Operating  System  Support  for  Shared  Hardware  Data  Structures 


by  Gedare  Bloom 

B.S.  in  Computer  Science  and  Mathematics,  May  2005,  Michigan  Technological  University 
M.S.  in  Computer  Science,  August  2012,  The  George  Washington  University 


A  Dissertation  submitted  to 


the  Faculty  of 

School  of  Engineering  and  Applied  Science 
of  The  George  Washington  University 
in  partial  satisfaction  of  the  requirements 
for  the  degree  of  Doctor  of  Philosophy 


January  31,  2013 


Dissertation  directed  by 
Bhagirath  Narahari 

Professor  of  Engineering  and  Applied  Science  and  of  Engineering  Management  &  Systems 

Engineering 

and 

Rahul  Simha 

Professor  of  Engineering  and  Applied  Science 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

31  JAN  2013 


2.  REPORT  TYPE 


4.  TITLE  AND  SUBTITLE 

Operating  System  Support  for  Shared  Hardware  Data  Structures 


6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

The  George  Washington  University, Faculty  of  School  of  Engineering  and 
Applied  Science, Washington, DC, 20052 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


3.  DATES  COVERED 

00-00-2013  to  00-00-2013 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

A  fundamental  problem  in  computing  is  that  processors  cannot  access  memory  fast  enough  to  stay  fully 
utilized.  Architecture  features  like  cache,  prefetching,  out-of-order  execution,  and  multiprocessing  only 
benefit  software  with  temporal  or  spatial  locality,  or  instruction-level  or  task-level  parallelism.  Software 
that  relies  on  fine-grained  access  to  data  with  structural  locality,  such  as  pointer-based  data  structures, 
derives  little  benefit  from  these  features.  The  importance  of  these  data  structures  motivates  a  new 
approach  to  improve  memory  performance.  A  hardware  data  structure  (HWDS)  implements  a  data 
structure  with  operations  that  leverage  parallelism  and  structural  locality  to  reduce  data  structure  access 
times,  but  only  supports  an  exclusive  data  structure  small  enough  to  fit  the  capacity  of  the  HWDS.  This 
thesis  proposes  operating  system  (OS)  support  for  HWDSs  so  that  applications  can  use  and  share  a  HWDS 
even  when  its  capacity  is  less  than  the  data  structure?s  size. 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

137 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


The  School  of  Engineering  and  Applied  Science  of  The  George  Washington  University 
certifies  that  Gedare  Bloom  has  passed  the  Final  Examination  for  the  degree  of  Doctor  of 
Philosophy  as  of  November  19,  2012.  This  is  the  final  and  approved  form  of  the  dissertation. 

Operating  System  Support  for  Shared  Hardware  Data  Structures 

Gedare  Bloom 


Dissertation  Research  Committee: 

Bhagirath  Narahari,  Professor  of  Engineering  and  Applied  Science  and  of  Engineering 
Management  &  Systems  Engineering,  Dissertation  Co-Director 

Rahul  Simha,  Professor  of  Engineering  and  Applied  Science,  Dissertation  Co-Director 
Gabriel  Parmer,  Assistant  Professor  of  Computer  Science,  Committee  Member 
Evan  Drumwright,  Assistant  Professor  of  Computer  Science,  Committee  Member 
Guru  Prasadh  Venkataramani,  Assistant  Professor  of  Engineering  and  Applied  Science, 
Committee  Member 


n 


©  Copyright  2013  by  Gedare  Bloom 
All  Rights  Reserved 


Dedication 


For  my  grandfather  Laird,  who  inspired  me  to  seek  higher  education. 


For  my  wife  Veronica,  who  inspires  me  to  better  myself. 


For  my  daughter  Annalise,  who  inspires  me  to  better  the  world. 


IV 


Acknowledgments 


Thank  you  Veronica,  for  challenging  me  to  improve  in  all  aspects  of  life.  I  love  you. 


No  one  is  an  island.  I  am  grateful  for  all  the  assistance  I  have  received  throughout  my 
life.  A  dissertation  is  the  culmination  of  a  long  journey,  an  epic  quest  of  self-discovery  that 
starts  when  the  young  mind  is  planted  with  the  seed  of  introspection.  Along  the  way,  many 
hands  help  to  sow  the  seed  and  till  its  soil,  and  to  the  people  whose  hands  have  helped  me, 
I  am  grateful. 

To  my  parents,  Uno  and  Jodi,  for  encouraging  academic  success,  praising  hard  work 
and  good  grades,  permitting  my  obsessive  reading,  and  for  chasing  their  dreams.  To  my 
siblings,  Jeni  and  Adam  for  enduring  and  passing  lessons  learned,  and  to  Ric  and  Ben  for 
following  and  providing  me  with  retrospective.  To  my  grandparents,  Laird  and  Marcia 
Heal,  Elsa  Bloom,  and  Beulah  Huff,  whose  memories  I  treasure,  for  instilling  in  me  the 
virtue  of  being  studious,  wholesome,  and  hard-working:  mens  sana  in  corpore  sano. 

To  my  aunts  and  uncles:  to  Dicky  for  embodying  sisu;  to  Sandy  Martin  for  introducing 
me  to  science,  to  Kathy  Soderbloom  for  a  larger  world  of  politics  and  religion,  to  Diana 
Anderson  for  intellectual  challenges  and  inspirations;  to  David,  Andy,  and  Loren  Heal  for 
introducing  the  digital  world  to  me,  to  Bud  Heal  for  maintaining  some  of  the  old  world, 
and  to  Kim  Rosser,  who  always  seems  positive  to  me. 

To  all  the  wonderful  teachers  and  professors  who  are  there  for  their  students. 

To  Mrs.  Weber,  my  5th  grade  science  teacher  who  first  introduced  me  to  controlled 
scientific  experimentation. 

To  Mr.  Wang,  my  7th  and  8th  grade  math  teacher,  for  seeing  in  me  a  skill  for  math 


v 


and  encouraging  me  to  develop  it  beyond  the  course  material,  and  for  the  occasional  pick¬ 
up  basketball  game,  in  which  I  got  to  socialize  with  a  teacher  outside  the  confines  of  the 
classroom — a  new  development. 

To  Mr.  Stelmaszak,  my  affable  but  demanding  Calculus  teacher,  for  encouraging  stu¬ 
dents  to  think  about  and  prepare  for  the  future. 

To  Mr.  Kedigh,  for  introducing  me  to  programming  and  computer  science. 

To  Dr.  Dave  Poplawski,  for  sponsorship  of  the  student  ACM  and  programming  com¬ 
petitions  at  Michigan  Tech. 

To  Dr.  Steve  Seidel,  for  introducing  me  to  the  world  of  research  and  academe  through 
the  MTU  UPC  seminar. 

To  Dr.  Soner  Onder,  for  teaching  me  enough  of  compilers  and  architecture  that  I  have 
hardly  needed  a  book  or  refresher  since,  a  truly  amazing  skill  of  a  great  teacher;  I  will 
always  remember  that  I  “cannot  bribe  God.” 

To  Dr.  Abdou  Youssef,  for  being  an  inspiration  both  in  the  classroom  and  out. 

To  Dr.  Poorvi  Vora,  for  your  passion  for  students  and  teaching. 

To  Dr.  Jonathan  Stanton,  for  introducing  me  to  the  world  of  systems  and  some  of  the 
realities  of  academic  life. 

To  my  advisors,  Bhagi  and  Rahul,  for  taking  me  under  wing  and  giving  me  the  freedom 
to  explore. 

To  Stefan  Popoveniuc,  for  being  a  great  sounding  board  and  working  with  me  on  my 
first  paper. 

To  Eugen  Leontie,  for  being  in  the  trenches  with  me;  our  successes  have  been  great  and 
I  am  glad  to  have  worked  with  you. 

To  Joe  Zambreno,  for  useful  advice  about  my  career  and  research.  You  have  helped  me 
to  see  the  world  through  a  different  lens. 


vi 


To  Gabe  Parmer,  for  your  excitement  and  input  about  my  research.  Our  conversations 
about  systems  has  been  great  for  my  intellectual  growth. 

To  Guru,  you  endorsed  my  work  when  I  was  uncertain  in  the  early  stages,  which  helped 
me  to  stay  positive  and  on  track. 

Friends  have  a  lot  to  do  with  how  a  mind  is  shaped  and  grows.  Some  friends  grow  and 
learn  with  you,  teach  you,  and  inspire  you  to  work  hard:  Dan  Mayo,  thank  you  for  being 
such  a  friend—  I  am  better  for  having  known  you. 

To  Rob  Weller,  Brandon  Wilson,  and  David  Deane,  for  the  boring  nights  and  the 
exciting,  for  inviting  me  out  despite  my  proclivity  toward  unpredictability  and  wildness. 

To  Dan  Clark,  Justin  Ter  Avest,  and  Adam  Shirey  for  8.31  and  all  the  rest. 

To  Joe  Vaillancourt,  for  helping  to  drag  me  along  at  times. 

To  Nick  Young  and  Jeremy  Koenen,  for  broomball,  sake,  and  puyo  puyo. 

To  my  friends  in  grad  school  who  helped  lessen  some  of  the  burdens  of  graduate  stu¬ 
dent  life,  thank  you  Darby  Thompson,  Rim  Yazigi,  Darakshan  Mir,  Kevin  Henry,  Amin 
Teymorian,  Liran  Ma,  Kerry  McKay,  and  Olga  Gelbart. 

To  my  lab  mates  Scotty  Smith  and  James  Marshall,  for  attending  more  than  their  share 
of  presentations  on  my  work,  and  to  James  for  encouraging  me  to  ride  my  bicycle  more, 
and  Scotty  for  putting  up  with  our  bicycle  talk. 

To  the  great  folks  who  work  with  RTEMS,  especially  Joel  Sherrill,  Chris  Johns,  and 
Sebastian  Huber,  for  the  support  over  the  past  few  years. 

To  Gary  Kreger  and  Curtis  Schoolman,  for  teaching  me  that  working  hard  is  just 
working. 

To  my  department  and  university,  and  to  industry  and  government  for  funding  my 
research  directly  and  indirectly  through  grants,  fellowships,  and  awards.  My  academic 
career  so  far  has  been  supported  in  part  by  Hewlett-Packard  (through  the  MTU  UPC),  the 

vii 


US  National  Science  Foundation  (NSF  grants  CNS-1117243,  CNS-0934725,  ITR-025207, 
CNS-0831149),  the  Air  Force  Office  of  Scientific  Research  (AFOSR  grants  FA9550-09-1- 
0194  and  FA955006-1-0152),  and  the  George  Washington  University  (teaching  fellowships, 
travel  awards,  and  summer  dissertation  fellowship). 

To  the  rest,  for  surely  I  missed  some,  I  give  thanks. 


Surely  there  must  be  a  less  primitive  way  of  making  big  changes  in  the  [memory] 
store  than  by  pushing  vast  numbers  of  words  back  and  forth  through  the  von 
Neumann  bottleneck.  Not  only  is  this  tube  a  literal  bottleneck  for  the  data 
traffic  of  a  problem,  but,  more  importantly,  it  is  an  intellectual  bottleneck  that 
has  kept  us  tied  to  word-at-a-time  thinking  instead  of  encouraging  us  to  think 
in  terms  of  the  larger  conceptual  units  of  the  task  at  hand.  Thus  programming 
is  basically  planning  and  detailing  the  enormous  traffic  of  words  through  the 
von  Neumann  bottleneck,  and  much  of  that  traffic  concerns  not  significant  data 
itself  but  where  to  find  it. 


—  John  Backus,  1977 

Advances  in  microelectronics  have  made  the  realization  of  “smart”  data  struc¬ 
tures  a  practical  reality. 

—  Charles  Leiserson,  1979 

Indeed,  I  believe  that  virtually  every  important  aspect  of  programming  arises 
somewhere  in  the  context  of  sorting  or  searching! 


viii 


Don  Knuth 


Sisu. 


IX 


Abstract  of  Dissertation 


Operating  System  Support  for  Shared  Hardware  Data  Structures 

A  fundamental  problem  in  computing  is  that  processors  cannot  access  memory  fast 
enough  to  stay  fully  utilized.  Architecture  features  like  cache,  prefetching,  out-of-order 
execution,  and  multiprocessing  only  benefit  software  with  temporal  or  spatial  locality,  or 
instruction-level  or  task-level  parallelism.  Software  that  relies  on  fine-grained  access  to 
data  with  structural  locality,  such  as  pointer-based  data  structures,  derives  little  benefit 
from  these  features.  The  importance  of  these  data  structures  motivates  a  new  approach 
to  improve  memory  performance.  A  hardware  data  structure  (HWDS)  implements  a  data 
structure  with  operations  that  leverage  parallelism  and  structural  locality  to  reduce  data 
structure  access  times,  but  only  supports  an  exclusive  data  structure  small  enough  to  fit  the 
capacity  of  the  HWDS.  This  thesis  proposes  operating  system  (OS)  support  for  HWDSs 
so  that  applications  can  use  and  share  a  HWDS  even  when  its  capacity  is  less  than  the 
data  structure’s  size. 

The  priority  queue  and  map  data  structures  demonstrate  the  appeal  of  an  OS  HWDS 
union.  A  GPS  benchmark  with  real-world  data  executes  24%  faster  using  a  HWDS  in¬ 
stead  of  a  software  data  structure,  even  though  the  data  exceeds  the  hardware’s  capacity. 
Compared  to  software  implementations,  a  128- node  HWDS  achieves  over  50%  faster  mean 
access  time  to  a  512-node  priority  queue,  and  15%  faster  mean  search  time  in  a  512-node 
read-mostly  map.  When  sharing  a  HWDS  among  four  maps  of  power-of-2  sizes  between 
64  and  512,  a  128-node  HWDS  achieves  35%  faster  searches  than  a  splay  tree.  These 
performance  improvements  are  made  possible  by  the  OS  support  for  HWDSs  proposed  in 
this  thesis. 
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Glossary  of  Terms 


container:  An  abstract  data  type  in  the  C++  STL. 

exception-based  HWDS:  A  HWDS  that  permits  direct  access  but  raises  exceptions 
when  the  HWDS  cannot  satisfy  a  request.  See  also:  interposition-based,  HWDS. 

heap:  A  data  structure  containing  key- value  pairs  that  orders  nodes  within  a  tree  accord¬ 
ing  to  a  rule  that  a  parent  node’s  key  is  greater  than  or  equal  (equivalently  less  than 
or  equal  for  a  max  heap)  to  its  children  nodes’  keys.  See  also:  priority  queue. 

HWDS  assignment:  Problem  of  determining  whether  a  data  structure  uses  a  HWDS  or 
a  software  implementation. 

HWDS  context:  HWDS  registers  and  data  associated  with  a  data  structure.  See  also: 
HWDS  context  switch 

HWDS  context  switch:  Saving  one  HWDS  context  and  restoring  another.  See  also: 
HWDS  context. 

interposition-based  HWDS:  A  HWDS  that  is  accessed  through  a  software  library  which 
avoids  making  invalid  requests  to  the  HWDS  by  checking  every  access.  See  also: 
exception-based  HWDS. 

locality:  The  tendency  of  memory  accesses  to  occur  in  clusters.  See  also:  spatial  locality, 
structural  locality,  temporal  locality  . 

map:  A  data  structure  that  contains  key-value  pairs  and  supports  an  efficient  mechanism 
to  lookup  (search)  nodes  by  key.  Also  known  as:  associative  array,  dictionary,  or 
search  tree. 
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multitasking:  OS-mediated  processor  sharing  for  multiple  execution  contexts.  See  also: 
scheduler,  task,  thread. 

node:  A  storage  unit  for  a  data  structure  comprising  one  or  more  data  and  link  (pointer) 
fields. 

priority  queue:  A  data  structure  that  contains  key-value  pairs  sorted  by  a  priority  stored 
in  the  key. 

red-black  tree:  A  balanced  tree  data  structure  named  for  the  node  coloring  rules  that 
ensure  a  bounded  height  imbalance.  See  also:  map 

scheduler:  Entity  that  controls  access  to  hardware  resources.  Commonly  used  for  sharing 
processor  time  or  access  to  devices. 

simultaneous  multithreading:  Hardware-supported  processor  sharing  for  multiple  ex¬ 
ecution  contexts  simultaneously  in  parallel.  See  also:  thread,  multitasking. 

skip  list:  A  list-of-lists  data  structure  that  stores  all  nodes  in  the  last  (bottom)  list,  and 
the  number  of  links  (height)  any  given  node  has  is  randomized.  See  also:  map 

spatial  locality:  Tendency  of  memory  accesses  to  be  located  near  each  other  in  the  mem¬ 
ory  address  space.  See  also:  locality. 

splay  tree:  A  self-adjusting  binary  search  tree  named  for  the  splay  operation,  which  moves 
recently  accessed  nodes  to  the  root  for  faster  access.  See  also:  map. 

split  HWDS:  HWDS  that  uses  an  overflow  data  structure  which  ignores  the  mechanisms 
of  the  HWDS.  See  also:  united  HWDS. 

stable:  A  property  of  a  priority  queue  or  map  data  structure  that  dequeues  of  nodes  of 
the  same  key  is  in  FIFO  order. 
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structural  locality:  Tendency  of  memory  accesses  to  follow  an  ordered  pattern.  See  also: 


locality. 

task:  A  schedulable  software  execution  context.  Also  known  as:  thread  or  process. 

temporal  locality:  Tendency  of  recent  memory  accesses  to  recur.  See  also:  locality. 

thread:  execution  context.  See  also:  simultaneous  multithreading,  task. 

united  HWDS:  HWDS  that  uses  an  overflow  data  structure  which  relies  on  the  HWDS 
to  improve  performance.  See  also:  split  HWDS. 

Zipf’s  distribution:  A  skewed  probability  distribution  generated  with  Zipf’s  law,  which 
states  the  probability  the  i’th  key  out  of  n  keys  will  be  accessed  is  inversely  propor¬ 
tional  to  i. 
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Chapter  1  —  Introduction 


Throughout  the  history  of  computing,  processors  have  outperformed  main  memory  [130]. 
Indeed,  the  performance  gap  has  steadily  increased  since  the  1980s,  leading  Wulf  and  Mc¬ 
Kee  [136]  to  coin  the  term  memory  wall  to  describe  the  bottleneck  caused  by  the  gap.  The 
memory  wall  arises  from  processor  performance  improving  faster  than  memory  bandwidth 
and  latency. 

One  technique  to  delay  the  impact  of  the  memory  wall  is  caching.  But  even  with 
an  infinite  size  cache  that  (pre)fetches  data  at  full  memory  bandwidth,  the  gap  between 
processor  speed  and  bandwidth  means  cache  misses  are  inevitable — enough  data  cannot 
move  into  the  cache  fast  enough  to  satisfy  the  processor.  When  the  cache  misses,  the 
memory  access  time  depends  on  latency  to  get  the  first  byte,  and  bandwidth  to  get  the 
rest.  Patterson  [99]  states  that  latency  lags  bandwidth:  a  historical  trend  indicates  that 
latency  improves  slower  than  bandwidth.  Yet  latency  dominates  bandwidth  in  determining 
the  performance  of  memory  accesses  for  small  sizes,  such  as  a  cache  line.  Poor  memory 
latency  means  that  cache  misses  become  more  expensive  relative  to  processor  cycle  times 
as  technology  improves.  Ten  years  ago,  a  1  GHz  processor  with  DDR-200  RAM  had  a 
memory  latency  around  52  CPU  cycles.  Five  years  ago,  a  4  GHz  processor  with  DDR2-800 
RAM  had  a  memory  latency  around  220  CPU  cycles. 

Meanwhile  Moore’s  law  abides:  a  prediction  that  a  new  chip  can  be  produced  with 
double  the  transistors—  potential  performance — compared  with  chips  made  less  than  two 
years  prior.  As  transistor  density  increases,  power  and  heat  dissipation  has  become  a  critical 
factor  in  chip  design  and  manufacture.  The  answer  from  the  architecture  community 
has  been  the  chip  multiprocessor,  or  multicore:  Excess  transistors  are  devoted  either  to 
increased  cache  or  to  more  processing  cores.  A  fundamental  assumption  of  multicore  is 
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that  applications  can  or  will  exploit  sufficient  parallelism  among  multiple  cores  to  achieve 
speedup.  Unfortunately,  parallel  programming  remains  hard,  despite  years  of  research  that 
has  yielded  promising  technologies  such  as  transactional  memory  [54]  and  lock-free  data 
structures  [32].  While  multicore  processors  delay  the  growing  gap  between  latency  and 
performance  by  processing  at  lower  frequencies,  latency  still  dominates  bandwidth,  and 
the  memory  wall  remains. 

Scaling  the  memory  wall  drives  research  in  both  computer  architecture  and  compilers. 
Computer  architects  introduced  hardware  prefetching  to  reduce  miss  rates,  and  techniques 
to  hide  cache  misses  when  sufficient  work  is  available — for  example  non-blocking  cache,  out- 
of-order  execution,  and  simultaneous  multithreading.  Compilers  play  a  role  in  controlling 
how  software  accesses  the  cache  and  can  reduce  miss  rates  using  techniques  such  as  software 
prefetching,  instruction  reordering,  memory  compaction,  and  loop  optimizations.  Most 
compiler  solutions  work  well  on  statically  known  or  easily  profiled  applications  such  as 
software  with  bounded  loops  and  fixed-size  arrays.  But  many  high-level  programs  are 
written  in  terms  of  data  structure  (or  object)  operations  and  interfaces,  and  not  in  terms 
of  loops  and  arrays. 

This  thesis  improves  the  state-of-the-art  by  supporting  the  use  of  excess  transistors 
to  improve  application  performance  through  a  fundamental  programming  construct  that 
spans  both  processor  and  memory:  the  data  structure.  A  hardware  data  structure  (HWDS) 
is  an  implementation  of  a  data  structure  that  is  supported  by  hardware  mechanisms  to  im¬ 
prove  data  structure  operations.  By  organizing  the  memory  hierarchy  in  terms  of  data 
structure  operations,  instead  of  cache  line  fetches,  HWDSs  permit  rethinking  how  proces¬ 
sors  access  memory.  More  important,  hardware  mechanisms  exploit  parallelism  to  reduce 
the  algorithmic  complexity  of  data  structure  operations,  which  can  yield  substantial  perfor¬ 
mance  benefits  compared  with  software  implementations;  see  Figure  1-1,  which  shows  how 
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time 


Figure  1-1:  The  advantage  of  hardware  is  parallelism.  Here,  an  insert  in  a  software  binary 
search  tree  requires  traversing  at  most  the  entire  depth  of  the  tree,  whereas  hardware  can 
insert  in  two  steps  by  broadcasting  and  comparing  the  new  value  in  parallel. 

hardware  can  insert  to  a  sorted  structure  faster  than  software  because  of  the  advantage  of 
parallel  comparisons. 

HWDSs  are  not  without  disadvantages  however,  most  of  which  stem  from  limited  hard¬ 
ware  resources.  Chip  space  allocated  to  the  HWDS  steals  from  other  features  such  as  cache 
and  on-chip  communications,  so  minimizing  the  HWDS  size  is  important.  The  main  dis¬ 
advantage  of  HWDSs  is  the  limited  hardware  capacity  that  can  be  devoted  to  supporting 
data  structure  operations;  see  Figure  l-2a.  Limited  hardware  capacity  precludes  using  one 
HWDS  for  each  software  data  structure,  so  sharing  the  HWDS  resources  in  multitasking 
environments  is  important;  see  Figure  l-2b,  which  depicts  two  data  structures  attempting 
to  use  a  HWDS  simultaneously.  The  HWDS  context  is  the  set  of  control  registers  and  data 
belonging  to  the  data  structure  that  is  loaded  in  a  HWDS. 

Hardware  support  for  specific  data  structures  has  been  proposed  in  the  past  (see  Chap¬ 
ter  2),  but  so  far  the  interface  between  the  HWDS  and  programmer  has  been  ignored. 
Most  existing  HWDSs  have  limited  interactions  with  operating  system  (OS)  and  applica¬ 
tion  software,  with  much  of  the  prior  work  allowing  only  one  data  structure  with  a  known 
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(a)  A  full  HWDS  cannot  accept  new  nodes.  (b)  Data  structures  cannot  share  the  HWDS. 
Figure  1-2:  Limited  hardware  resources  create  disadvantages  for  HWDSs. 


maximum  size  (less  than  HWDS  capacity)  to  use  the  HWDS;  a  notable  exception  is  the 
work  of  Chandra  and  Sinnen  [27],  which  is  reviewed  in  Section  2.1  and  compared  with  the 
approach  of  this  thesis  in  Section  4.3.2.  Sharing  a  HWDS  among  arbitrarily-sized  data 
structures  requires  extra  support  in  both  hardware  and  software. 

This  thesis  shows  that  OS,  application,  and  HWDS  interactions  are  crucial  to  realizing 
efficient  HWDSs  that  arbitrarily-sized  data  structures  can  share.  Architecture  features 
enable  OS  and  application  use  of  HWDSs.  OS  support  extends  the  capabilities  of  HWDSs 
beyond  prior  art  with  support  for  arbitrary-sized  structures  and  sharing  a  HWDS  among 
tasks.  (Throughout  this  dissertation,  task  denotes  a  software  context  and  thread  denotes 
a  hardware  context.)  HWDSs  can  also  improve  the  performance  of  OS  data  structure 
operations,  and  contribute  knowledge  about  task  behavior  with  respect  to  data  structure 
usage. 

Yesterday’s  data  structures  were  written  together  with  application  code.  Today’s  data 
structures  come  in  optimized,  portable,  mature  libraries.  Tomorrow’s  data  structures 
should  ship  with  the  hardware  support  to  use  them  well.  This  thesis  shows  the  promise  of 
HWDSs  as  a  new  interface  between  software  and  memory. 
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1.1  Impact 


Niklaus  Wirth  wrote  that  “Algorithms  +  Data  Structures  =  Programs,”  a  maxim  that  has 
gained  strength  as  software  has  become  more  complex  and  data  structures  more  impor¬ 
tant.  Modern  programmers  can  choose  data  structures  from  optimized  libraries  such  as 
the  Standard  Template  Library  (STL)  or  Boost  in  C++,  and  the  Java  collections  frame¬ 
work.  These  libraries  stress  both  performance  and  flexibility,  but  their  performance  is 
often  limited  to  an  O(logn)  algorithmic  factor — and  the  dynamic  nature  of  these  struc¬ 
tures  lessens  the  benefits  of  prefetching  and  caching.  This  thesis  shows  that  HWDSs  can 
improve  performance  by  reducing  that  algorithmic  factor  to  0(1)  for  common  operations 
in  ideal  cases,  and  when  the  ideal  is  not  met  then  extra  support  from  the  OS  helps  to 
maintain  performance  improvements. 

The  following  examples  demonstrate  the  potential  for  improvement  from  data  structures 
implementing  the  two  abstract  data  types  considered  in  this  thesis,  the  priority  queue  and 
map : 

•  Planning  algorithms.  Two  popular  algorithms  that  use  priority  queues  are  Dijkstra’s 
shortest-path  algorithm  and  the  A*  planning  algorithm.  Experiments  show  that  Di¬ 
jkstra’s  algorithm  often  spends  50-60%  of  its  execution  time  in  the  priority  queue  [81]. 
Our  own  experiments  on  real-world  maps  taken  from  the  DIMACS  shortest  path  im¬ 
plementation  challenge  benchmarks  [26]  show  the  benchmark  spends  up  to  29%  of 
its  time  inside  the  priority  queue. 

•  Image  analysis.  The  grey-weighted  distance  transform  on  3D  images  uses  a  software 
priority  queue  [82],  Measurements  show  the  priority  queue  accounts  for  over  30%  of 
the  application’s  execution  time;  see  Section  6.4. 
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•  Discrete  event  simulation.  A  priority  queue  organizes  pending  events  in  a  discrete- 
event  simulation  (such  as  a  queueing  network  or  integrated  circuit  simulation),  and 
has  been  a  popular  test  case  for  priority  queue  implementations  [61,  105].  Such 
simulations  spend  up  to  40%  of  execution  time  managing  the  queue  [105]. 

•  Fine-grained  multitasking.  Carbon  [73]  uses  hardware  queues  to  improve  fine-grained 
multitasking  for  Recognition,  Mining,  and  Synthesis.  Compared  to  software  ap¬ 
proaches,  Carbon  can  achieve  68%  faster  execution  time  for  loop-level  parallelism, 
and  109%  for  task-level  parallelism. 

•  Real-time  task  scheduling.  In  prior  work,  I  have  shown  that  a  hardware  priority 
queue  reduces  scheduling  overheads  and  improves  predictability  [16];  others  have 
shown  that  a  hardware  priority  queue  can  reduce  task  scheduler  overhead  from  18% 
in  software  to  less  than  0.5%  [72], 

•  Web  browsers.  The  Chromium  web  browser  makes  extensive  use  of  the  C+- 1-  STL 
map  container,  which  often  is  implemented  as  a  red-black  tree.  Profiling  (see  Ap¬ 
pendix  A)  of  this  code  shows  that — even  for  a  short  session  of  starting,  loading  a 
blank  page,  and  stopping — Chromium  creates  1907  maps  and  executes  49,483  find 
operations  that  consume  436,758,391  cycles  of  map  execution  time,  or  approximately 
12%  of  overall  execution  time. 

•  Programming  languages.  Interpreted  languages  need  to  look  up  strings  frequently,  as 
do  systems  that  monitor  memory  accesses.  For  example,  Akritidis  et  al.  [7]  use  a  splay 
tree — a  self-adjusting  binary  search  tree  (BST)—  referent  object  checker  and  evalu¬ 
ated  it  on  the  Olden  and  SPECINT  2000  benchmarks — for  Olden  the  time  overhead 
of  using  the  checker  was  30%  on  average  (excluding  two  benchmarks);  for  SPECINT 
2000  the  overhead  was  on  average  900%  and  exceeded  100%  for  all  benchmarks. 
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•  OS  search  trees.  Pfaff  [100]  evaluates  implementations  of  BSTs — including  random 
BSTs,  self-balancing  BSTs  (AVL  and  red-black),  and  splay  trees — in  the  context  of 
systems  usage.  The  systems  applications  used  to  evaluate  the  BSTs  are  virtual  mem¬ 
ory  address  (VMA)  mapping  in  Linux,  IP  peer  caching,  and  index  cross-reference 
collation.  With  real-world  data,  a  splay  tree  implementation  of  VMA  mapping  im¬ 
proves  performance  of  Mozilla,  VMware,  and  Squid  test  sets  by  23%  to  40%.  Other 
uses  of  balanced  search  trees  in  Linux  include:  input/output  (I/O)  schedulers,  optical 
device  driver,  high-resolution  timers,  ext3  filesystem  directory  entries,  and  crypto¬ 
graphic  keys  [1]. 

•  Key-value  stores  Key-value  stores  implement  straightforward  searching  with  keys 
that  are  often  either  strings  or  integers.  Search  benchmarks  model  the  applica¬ 
tion  processing  of  key-value  stores;  OS  processing  time  of  key-value  stores  can  be 
substantial — when  requests  are  small  memcached  spends  up  to  80%  of  its  time  in  OS 
code  primarily  for  network  packet  processing  [20]. 

These  applications  are  just  a  sample  of  the  uses  for  priority  queues  and  maps.  OS  support 
for  HWDS  use  in  these  applications  can  eliminate  much  of  the  time  spent  processing  data 
structure  operations. 

1.2  Overview 

In  using  a  data  structure,  an  application  “reads”  (searches  or  iterates)  and  “writes”  (in¬ 
serts  or  removes)  nodes.  A  data  structure’s  read/write  operations  abstract  the  lower  level 
load/store  operations  that  comprise  a  processor’s  interface  to  memory.  By  supporting  the 
high-level  abstraction  of  data  structure  operations,  HWDSs  enable  applications  to  extract 
fine- grained  parallelism  from  their  data  structures. 
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void  bubble_up(int  i)  { 

while  (i  >  1  &&  heap[i] ->key  < 
heap[PARENT(i) ] ->key)  { 

swap_entries  (i,  PARENT(i) ) ; 
i  =  PARENT (i) ; 

} 

} 

void  heap_insert(int  key, int  val)  { 
int  s  =  ++heap_current_size; 

heap[s]  =  alloc_init_node(  keyval) ;  void  pq_insert(  int  id,  int  val,  intkey){ 
bubble_up(s) ;  HWDS_I NSERT ( id ,  key,  val); 

}  } 


(a)  Insertion  code  for  a  software  binary  heap. 


(b)  Insertion  code  for  a  priority  queue  HWDS. 


Figure  1-3:  Program  code  changes  when  using  a  HWDS. 


Figures  l-3a  and  l-3b  demonstrate  the  expressive  power  of  a  HWDS  abstraction  with 
the  insert  code  of  a  software  priority  queue  implemented  as  a  binary  heap,  and  the  insert 
code  of  a  priority  queue  using  a  HWDS  respectively. 

Figure  l-4a  shows  how  a  HWDS  can  fit  with  other  computer  hardware  in  a  uniprocessor 
setting;  multicore  chips  introduce  complications  for  sharing  and  communication,  and  one 
possible  configuration  is  shown  in  Figure  l-4b.  Design  space  exploration  for  both  uni-  and 
multi-processing  with  HWDSs  is  interesting  future  work. 


(a)  Computer  organization.  (b)  Multicore  computer  organization. 


Figure  1-4:  HWDS  architecture  overview. 


This  thesis  makes  it  possible  to  use  a  HWDS  even  when  the  application’s  data  needs 
exceed  the  HWDS  capacity,  or  when  multiple  data  structures  attempt  to  share  the  HWDS 
concurrently.  I  demonstrate  the  benefit  of  OS  support  for  HWDSs  with  use  cases  and 
synthetic  benchmarks  that  are  executed  using  cycle-accurate  simulation. 

1.2.1  Overflow  handling 

Generic  applications  require  support  for  data  structures  of  arbitrary  size.  Since  hardware 
has  a  fixed  capacity,  arbitrarily  large  data  sets  eventually  will  cause  overflow.  A  HWDS 
is  like  a  write-back  cache:  it  must  save  dirty  nodes  to  backing  storage  or  else  the  updated 
data  would  be  lost.  This  is  in  opposition  to  a  write-through  or  read-only  cache,  which  can 
handle  overflow  by  simply  removing  nodes  from  the  hardware  unit’s  storage  because  the 
backing  storage  already  contains  the  up-to-date  node’s  data. 

The  specifics  of  overflow  handling  depends  on  the  implementation  of  the  HWDS,  but 
the  general  concept  is  universal.  To  deal  with  overflow,  HWDS  control  logic  and  software 
(for  example,  the  OS)  spill  data  out  of  the  HWDS  and  into  an  overflow  data  structure  in 
secondary  storage  (main  memory);  see  Figure  l-5a.  Conversely,  control  logic  and  software 
fill  data  from  the  overflow  data  structure  when  the  HWDS  needs  to  access  nodes  that 
it  previously  spilled;  see  Figure  l-5b.  Section  3.1  describes  HWDS  overflow  handling  in 
greater  detail. 

1.2.2  Sharing  HWDS  resources:  HWDS  assignment 

Multiple  data  structures  might  share  a  HWDS,  for  example  when  two  applications  execute 
concurrently  and  use  the  hardware  for  different  data  structures.  Sharing  is  a  traditional  OS 
problem  of  how  to  manage  contention  for  a  limited  hardware  resource:  The  usual  solution 
is  scheduling.  This  thesis  turns  the  sharing  problem  into  that  of  HWDS  assignment ,  which 
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(a)  Spilling  to  handle  overflow.  (b)  Filling  to  handle  underflow. 

Figure  1-5:  Handling  limited  hardware  capacity  with  HWDSs. 
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Figure  1-6:  HWDS  sharing  with  a  HWDS  context  switch. 


is  the  problem  of  determining  whether  a  data  structure  uses  a  HWDS  or  a  software-only 
implementation.  When  two  data  structures  do  share  a  HWDS,  the  OS  supports  the  HWDS 
with  a  HWDS  context  switch — spilling  the  nodes  for  the  current  HWDS  context  and  filling 
nodes  for  the  requested  data  structure;  see  Figure  1-6.  Section  3.2  further  illuminates  the 
problem  of  sharing  HWDS  resources  and  its  solution,  HWDS  assignment. 
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1.3  Contributions 


This  thesis  explores  the  hardware-software  interface  of  HWDSs  with  a  holistic  approach 
that  has  many  contributions  including: 

•  Operation-level  interface  for  applications  to  use  HWDSs.  An  interface  be¬ 
tween  HWDSs  and  software  gives  applications  access  to  HWDS  resources  and  im¬ 
proves  program  performance.  The  programming  interface  is  at  the  level  of  data 
structure  operations,  and  the  implementation  is  at  the  instruction  set  architecture 
(ISA)  level  so  that  future  improvements  in  the  hardware  microarchitecture  do  not 
affect  the  interface. 

•  Effective  use  of  parallelism  compared  to  conventional  architectures.  Explic¬ 
itly  parallel  architectures  require  a  programmer  to  partition  and  synchronize  shared 
data  accesses.  HWDSs  use  implicit  parallelism  to  achieve  high-performance  parallel 
computing  without  burdening  the  programmer  with  consistency  and  tasking  models. 
Implicit  parallelism  improves  software  performance  at  little  cost  to  the  programmer. 

•  Spilling  HWDS  overflow  data.  Hardware  and  software  work  together  to  support 
large  data  structures  that  overflow  hardware  capacity.  Although  some  performance 
is  lost,  the  HWDS  approach  remains  competitive  with  software-only  solutions.  Com¬ 
pared  to  software  implementations,  a  128-node  HWDS  achieves  over  50%  faster  mean 
access  time  to  a  512-node  priority  queue,  and  15%  faster  mean  search  time  in  a  512- 
node  read-mostly  map. 

•  HWDS  Assignment  for  sharing  a  HWDS.  HWDS  assignment  is  supported 
by  the  OS  to  share  and  restrict  available  HWDS  resources  among  multiple  data 
structures.  When  sharing  a  HWDS  among  four  maps  of  power-of-2  sizes  between  64 
and  512,  a  128-node  HWDS  achieves  35%  faster  searches  than  a  splay  tree.  Eviction 
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of  oversized  HWDSs  enables  the  OS  to  make  dynamic  assignment  decisions  to  limit 
performance  loss;  when  a  128-node  HWDS  is  used  for  a  512-node  map  that  is  updated 
and  searched,  an  eviction  policy  yields  16%  performance  loss,  but  performance  loss 
without  eviction  is  64%.  Prior  art  does  not  offer  any  solutions  for  HWDS  assignment, 
so  these  performance  improvements  are  made  possible  solely  by  the  OS  support  for 
HWDSs  proposed  in  this  thesis. 

•  Support  for  many  kinds  of  data  structures.  The  priority  queue  and  map  are 

examples  of  HWDSs  that  improve  the  performance  of  sorting  and  searching,  two 
fundamental  problems  in  computing.  The  policies  and  solutions  of  this  thesis  apply 
to  both  kinds  of  data  structures,  and  future  work  can  investigate  others  such  as 
string-based  or  hashing  structures. 

•  Increased  real-time  schedulability.  HWDSs  can  benefit  real-time  systems  by 
reducing  worst-case  execution  times  (WCETs)  even  when  multiple  data  structures 
share  a  HWDS  or  when  data  structure  sizes  exceed  HWDS  capacity. 

•  Evaluation  with  cycle-accurate  timing,  real  systems,  and  real-world  data. 

Real  applications  and  synthetic  benchmarks  validate  the  HWDS  approach  using 
cycle-accurate  fully-functional  simulation.  OS  support  is  designed  and  implemented 
in  the  Real-Time  Executive  for  Multiprocessor  Systems  (RTEMS)  real-time  oper¬ 
ating  system  (RTOS),  so  real  OS  overheads  are  included  in  the  experiments.  The 
simulator  executes  HWDS  operations  and  accounts  for  operation  latency  as  part  of 
the  cycle  time.  Experiments  are  conducted  using  applications  and  microbenchmarks 
that  use  data  structures  with  both  software  and  HWDS  implementations. 

With  respect  to  prior  art,  an  experiment  using  a  GPS  benchmark  with  real-world  data 
is  conducted  that  compares  overflow  handling  with  the  exception-based  united  HWDS 
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proposed  by  this  thesis  with  the  interposition-based  split  HWDS  proposed  by  others  [27]; 
see  section  4.3.2.  When  using  the  united  HWDS,  the  benchmark  executes  24%  faster 
than  when  using  a  software  implementation,  even  though  the  data  structure  size  exceeds 
the  hardware’s  capacity.  The  benchmark  using  the  split  HWDS  never  does  better  than 
software  in  the  presence  of  overflow. 

The  OS  support  for  HWDSs  presented  in  this  thesis  bears  some  resemblance  to  poli¬ 
cies  and  mechanisms  for  cache  and  translation  lookaside  buffer  (TLB)  management,  but 
the  structural  locality ,  operation  diversity,  and  design  and  implementation  multiplicity  of 
HWDSs  demand  new  solutions.  Memory  cache  is  a  reflection  of  a  flat  array  of  storage,  and 
leverages  the  independence  between  cache  lines  for  fast,  effective  fetching  and  replacing.  A 
HWDS  has  connections  between  nodes  that  must  be  preserved,  which  would  require  com¬ 
plex  hardware  to  implement  structure-preserving  overflow.  HWDSs  support  common  data 
structure  operations  that  encode  high-level  abstractions  in  low-level  mechanisms,  whereas 
cache  and  TLB  are  limited  to  the  load/store  memory  interface.  Extant  solutions  to  hard¬ 
ware  overflow  and  sharing  that  rely  on  hardware  mediation  are  not  useful  across  multiple 
kinds  of  HWDSs,  and  hardware  management  for  any  given  HWDS  implementation  would 
drive  up  its  cost  and  complexity  in  terms  of  both  development  and  hardware  resources. 
The  structural  locality,  operational  richness,  and  design  diversity  motivate  software  man¬ 
agement  of  HWDSs.  This  thesis  shows  that  software — more  flexible,  fixable,  and  forward- 
compatible  than  hardware — can  manage  HWDSs  efficiently  to  provide  performance  gains 
for  applications  and  systems  software. 

1.4  Scope 

Investigation  of  HWDSs  is  an  open-ended  area  of  research.  Limits  on  the  scope  of  this 
thesis  delineate  what  is  and  is  not  investigated. 
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This  thesis  investigates:  architectural  support  for  HWDSs  with  Simics/GEMS,  OS 
support  with  RTEMS  in  a  uniprocessor  setting,  representative  data  structures  (priority 
queue  and  map)  and  applications,  HWDSs  in  real-time  systems,  and  the  performance  of 
HWDSs  versus  software-only  solutions. 

This  thesis  does  not  investigate:  real  hardware  or  general  purpose  OS  (e.g.  Linux)  im¬ 
plementations,  design  space  exploration  for  HWDS  interfaces  or  implementations,  compiler 
support  for  HWDS,  sharing  a  HWDS  among  multiple  tasks  with  a  single  data  structure, 
OS  optimizations  that  use  the  knowledge  about  applications  gleaned  from  HWDS  behav¬ 
ior,  multiprocessor  architectures,  and  metrics  related  to  power,  reliability,  or  usability.  All 
of  these  areas  are  possible  directions  for  future  work. 

1.5  Outline 

This  thesis  is  organized  as  follows.  Chapter  2  reviews  the  related  work  in  the  field.  Chap¬ 
ter  3  describes  the  generic  OS  support  for  HWDSs  necessary  for  overflow  handling  and 
HWDS  assignment.  Chapter  4  describes  an  example  of  a  HWDS  that  implements  a  pri¬ 
ority  queue,  refines  the  generic  overflow  handling  support,  and  presents  experimental  re¬ 
sults  that  demonstrate  the  performance  of  overflow  handling  and  HWDS  assignment  for 
two  important  priority  queue  applications:  discrete  event  simulation  and  path  planning. 
Chapter  5  proposes  a  HWDS  implementation  of  a  map  for  efficient  searching,  and  presents 
experimental  results  from  a  synthetic  search  benchmark.  Chapter  6  shows  how  real-time 
systems  can  use  HWDSs  to  improve  the  schedulability  of  task  sets  by  reducing  WCETs; 
I  evaluate  four  HWDS  assignment  algorithms  using  experiments  and  benchmarks  mod¬ 
eled  from  real-world  applications.  Chapter  7  identifies  possibilities  for  future  work  and 
concludes. 
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Chapter  2  —  Literature  Review 


The  work  most  closely  related  to  this  thesis  are  in  the  areas  of  design  of  HWDSs,  hardware 
support  for  fine-grained  parallelism,  shipping  code  to  data,  linked  prefetching,  object-based 
systems,  and  transactional  memory.  The  following  reviews  each  of  these  in  turn. 

2.1  Design  and  Implementation  of  HWDSs 

2.1.1  HWDSs  for  network  routing 

Hardware  support  for  scheduling  has  been  an  area  of  interest  in  the  queuing  hardware  of 
packet-switched  networks.  Moon  et  al.  [87]  compare  four  approaches  to  hardware  priority 
queues  for  high-speed  networks  and  introduce  an  approach  that  melds  two  of  the  previ¬ 
ous  solutions.  Kim  and  Shin  [65]  describe  an  architecture  for  EDF  scheduling  for  ATM 
switch  networks  and  introduce  deadline  folding  to  circumvent  limitations  in  the  range  of 
priority  values.  Bhagwan  and  Lin  [14]  introduce  a  heap-based  hardware  priority  queue 
with  pipelined  stages  of  the  enqueue  and  dequeue  operations.  Morton  et  al.  [89]  describe 
a  hardware  priority  queue  that  does  not  require  hardware  comparators. 

How  this  thesis  differs  Although  packet-switched  routers  can  benefit  from  hardware 
priority  queues,  software  has  no  interface  to  access  the  priority  queues — they  are  only 
useful  for  sorting  network  packets.  This  study  enables  software  to  use  the  priority  queues 
by  exposing  an  interface  to  the  hardware  so  that  software  can  benefit  from  the  hardware 
acceleration  while  remaining  flexible  to  implement  different  algorithms  using  functional 
memory. 
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2.1.2  HWDSs  for  real-time  scheduling 


Approaches  for  hardware-based  packet  scheduling  have  been  extended  for  task  scheduling  in 
RTOSs.  The  goals  of  hardware  support  for  real-time  scheduling  are  to  minimize  scheduling 
latency  and  provide  highly  predictable  multiprocessing.  The  Spring  Scheduling  Coproces¬ 
sor  (SSCoP)  [24]  is  one  of  the  first  examples  of  a  hardware  task  scheduler  and  introduces 
simple  queues  for  the  set  of  scheduled  tasks.  Others  have  implemented  hardware  scheduling 
using  some  form  of  custom  logic  and  a  hardware  priority  queue  [108,  71,  69,  16,  72,  115]. 

How  this  thesis  differs  In  contrast  to  the  prior  work,  which  focuses  on  hardware  sup¬ 
port  for  a  single  fixed-size  priority  queue,  this  thesis  allows  arbitrarily-large  priority  queues 
to  share  a  hardware  priority  queue. 

2.1.3  HWDSs  for  reconfigurable  computing  with  Java 

Chandra  and  Sinnen  [27]  investigate  HWDSs  in  the  context  of  integrating  a  high-level 
language,  Java,  with  reconfigurable  computing.  In  addition  to  the  usual  priority  queue  op¬ 
erations,  the  authors  investigate  how  to  increase  the  queue  length,  use  non-integer  priority 
values,  and  add  new  operations. 

How  this  thesis  differs  Chandra  and  Sinnen  do  not  consider  how  HWDSs  are  shared 
and  scheduled  among  multiple  consumers.  Their  approach,  a  split  interposition-based 
HWDS,  does  not  handle  overflow  well;  see  Section  4.3.2. 

2.1.4  Systolic  Priority  Queues 

Leiserson  [77]  describes  systolic  HWDS  implementations  including  priority  queue,  multi¬ 
queue,  and  tree.  He  suggests  that  overflow  be  handled  by  the  OS,  and  that  pairing  an 
insert  with  an  extract  can  handle  refilling  the  HWDS. 


16 


How  this  thesis  differs  Leiserson  focuses  on  the  hardware  design  of  systolic  HWDSs 
with  only  cursory  examination  given  to  the  software-side  of  the  HWDS-OS  equation.  This 
thesis  demonstrates  that  intelligent  software  support  is  necessary  to  achieve  good  perfor¬ 
mance  from  HWDSs  in  the  presence  of  overflow  and  sharing. 

2.1.5  Abstract  Datatype  Processors 

Kim  [67]  and  Wu  et  al.  [134]  share  the  vision  of  raising  the  abstraction  of  hardware  to 
that  of  software;  their  work  proposes  and  evaluates  abstract  datatype  processors,  which 
accelerate  data  types  with  mechanisms  and  performance  similar  to  HWDSs.  Abstract 
datatype  instructions  can  reduce  instruction  fetch  times  by  21-48%  and  data  read/write 
times  by  22-40%.  The  datatypes  they  investigated  are  the  sparse  vector  and  hash  table, 
and  hardware  support  is  modeled  with  a  content-addressable  memory  (CAM). 

How  this  thesis  differs  Abstract  datatype  instructions  currently  ignore  capacity  and 
sharing  problems,  but  the  similarity  between  these  instructions  and  HWDSs  indicates 
similar  problems  exist  due  to  hardware  size  limitations. 

2.1.6  Content-addressable  memory  (CAM) 

Hardware  can  search  small  sets  of  records  with  numerical  keys  efficiently  with  a  CAM. 
Ternary  CAMs  [97]  can  implement  approximate  search  for  some  applications,  such  as 
longest  prefix  matching. 

A  common  use  for  CAM  in  modern  computing  is  as  a  read-only  cache  for  page  tables — 
the  virtual-to-physical  address  translation  map  that  underlies  page-based  virtual  memory 
systems.  This  cache  is  called  the  TLB,  and  its  purpose  is  to  cache  translations  for  fast 
lookup.  Tagged  TLBs  permit  cached  entries  from  multiple  page  tables  to  share  the  TLB. 
TLB  overflow  is  handled  by  dropping  entries;  since  the  TLB  is  a  read-only  cache,  the 
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backing  data  remains  in  memory.  However  if  the  page  table  is  modified,  the  TLB  needs  to 
be  refreshed  or  invalidated. 

How  this  thesis  differs  CAMs  do  not  permit  searching  with  arbitrary-sized  or  multiple 
data  sets  because  of  limited  hardware  capacity,  but  the  solutions  posed  in  this  thesis  may 
be  used  with  CAMs  to  implement  a  map  HWDS. 

The  primary  difference  between  the  page  table-TLB  and  the  HWDSs  employed  in  this 
thesis  is  that  the  TLB  acts  as  a  read-only  cache  for  the  page  table,  whereas  this  thesis 
uses  HWDSs  like  a  write-through  cache  for  the  overflow  data  structure.  Although  subtle, 
this  difference  is  important.  Other  differences  include:  a  TLB  does  not  export  a  search 
function;  a  task  or  process  only  gets  to  use  one  page  table  at  a  time;  TLBs  do  not  in 
general  support  arbitrary  search  keys — the  address  translation  relies  on  the  size  of  pages 
in  the  page  table  to  divide  the  search  space. 

2.1.7  Scratchpad  memory  (SPM) 

An  alternative  to  caching  in  the  embedded  domain  is  a  scratchpad  memory  (SPM)  [101]. 
SPMs  can  provide  predictable  access  times  and  software  control  over  code  [133]  and 
data  [125].  SPMs  are  software-managed:  applications  and  compilers  control  the  data  (and 
code)  residing  in  the  SPM.  Co-mingling  SPMs  with  custom  hardware  can  provide  further 
benefits  such  as  intelligent  object-based  allocation  [129,  128]. 

How  this  thesis  differs  Software  that  uses  a  SPM  still  executes  serially  to  access  data 
structures.  HWDSs  execute  in  parallel  and  require  different  management  than  scratch¬ 
pads  because  of  the  increased  hardware  complexity  in  HWDS  logic.  Combining  the  two 
approaches  to  use  a  HWDS  with  a  SPM  as  the  backing  store  may  be  useful  for  overflow 
handling. 
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2.1.8  Reconfigurable  computing  data  structures 

A  seminal  paper  in  reconfigurable  computing  (RC)  design  by  Dehon  et  al.  [40]  proposes 
classes  of  design  pattern  for  RC.  One  of  these  design  pattern  classes  is  the  Value-Added 
Memory  Patterns  which  includes  CAMs,  priority  queues,  and  other  data  structures.  Some 
of  the  other  data  structures  implemented  in  RC  logic  include  graphs  [85,  41]  and  trees  [117]. 
These  data  structure  implementations  can  be  reused  most  easily  in  a  HWDS  framework 
that  executes  as  a  co-processor. 

How  this  thesis  differs  Existing  RC  data  structures  do  not  support  general  applications 
because  sharing  and  overflow  are  not  addressed. 

2.1.9  String  matching 

Modern  applications  increasingly  rely  on  text  processing — for  example  parsing  web  docu¬ 
ments,  string  search,  and  regular  expression  matching — that  benefits  from  hardware  sup¬ 
port  for  string  matching  [25];  so  do  network  appliances  for  deep  packet  inspection  in  intru¬ 
sion  detection  [30,  118,  63,  139,  60]. 

How  this  thesis  differs  String  and  regular  expression  matching  architectures  implement 
HWDSs  for  specialized  string-based  applications.  Future  work  can  make  use  of  these 
HWDSs  and  improve  their  generality  across  application  domains  by  employing  the  results 
of  this  thesis. 

2.2  Fine-grained  Parallelism 

If  a  programmer  decomposes  a  program  into  small  independent  tasks  then  the  program  has 
more  potential  parallelism  and,  by  Amdahl’s  law  [10],  greater  speedup.  Thus,  multicore 
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platforms  should  support  fine-grained  task-level  parallelism  (or  thread-level  parallelism) 
(TLP)  for  greater  speedup.  The  challenge  for  fine-grained  TLP  is  to  maintain  overheads 
proportional  to  task  sizes  and  to  avoid  solutions  that  degrade  performance,  for  example  by 
destroying  locality.  Improving  TLP  performance  for  current  and  next  generation  processors 
shares  common  ground  with  HWDSs,  both  in  motivation  and  solution  methods. 

2.2.1  Carbon 

Kumar  et  al.  introduce  Carbon  [73],  hardware  acceleration  for  multicore  task  scheduling 
with  task  last-in,  first-outs  (LIFOs),  prefetchers,  and  work  stealing  in  hardware  to  support 
fine-grained  TLP.  Carbon  exposes  a  task  queue  application  programming  interface  (API) 
in  the  form  of  infraction  set  architecture  (ISA)  extensions,  so  it  is  similar  to  the  HWDS 
paradigm. 

How  this  thesis  differs  In  Carbon,  the  queues  are  used  specifically  for  task  schedul¬ 
ing,  which  means  that  applications  only  benefit  if  Carbon  extracts  sufficient  fine-grained 
TLP.  Carbon  provides  no  benefit  to  serial  workloads  and  requires  small  task  sizes  to  see 
improvement  over  software  scheduling.  A  HWDS  configured  as  a  LIFO  would  be  similar 
to  the  single  core  configuration  of  Carbon. 

2.2.2  Ne-XVP 

A  research  project  at  NXP  Semiconductors  (formerly  Phillips  Semiconductors),  the  Ne- 
XVP  architecture  aims  to  provide  an  efficient  multimedia  processor  platform.  Three  specific 
aspects  of  the  Ne-XVP  are  relevant  to  this  study:  the  Task  Scheduling  Unit)  [56],  Task 
Management  Unit)  [113],  and  Hardware  Task  Scheduler  [8].  Unlike  with  Carbon  and 
HWDS,  the  hardware  queues  in  Ne-XVP  are  not  exposed  at  an  API  or  ISA  level. 
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How  this  thesis  differs  As  in  Carbon,  the  scheduling  policy  is  inflexible  and  programs 
that  lack  TLP  cannot  improve  from  the  extra  hardware  support.  Our  project  allows  pro¬ 
grams  to  improve  serial  performance  bottlenecks  by  taking  advantage  of  parallelism  in  data 
structures. 

2.2.3  Asynchronous  Direct  Messages 

Sanchez  et  al.  [109]  introduce  asynchronous  direct  messages  (ADM)  to  provide  message 
passing  akin  to  interprocessor  interrupts  but  avoiding  the  cache  hierarchy.  The  authors 
implement  work-stealing  scheduling  algorithms  for  multicore  platforms  in  the  context  of 
fine-grained  parallel  workloads  using  ADM.  Task  queues  are  maintained  in  software,  so 
that  ADM  is  the  only  hardware  component  of  the  task  scheduler.  New  privileged  software 
handles  the  receive  buffer  overflow  and  underflow  conditions.  Privileged  software  also  is 
responsible  for  mapping  each  scheduled  task  to  a  specific  core  for  translating  destination 
task  IDs  when  routing  messages. 

How  this  thesis  differs  Asynchronous  direct  messages  attack  the  communication  bot¬ 
tlenecks  between  tasks  in  a  multicore  platform,  whereas  this  thesis  focuses  on  the  bottleneck 
of  serial  memory  accesses  during  data  structure  operations. 

2.2.4  HAQu 

Lee  et  al.  [76]  propose  a  hardware  accelerated  queue  (HAQu,  pronounced  “haiku”)  that 
accelerates  software  queues  for  multicore  platforms.  Unlike  the  work  reviewed  so  far,  HAQu 
does  not  use  a  hardware  queue;  instead  HAQu  implements  queuing  through  an  application’s 
address  space.  Hardware  buffers  queue  operations  through  a  unit  that  complements  each 
core’s  pipeline. 
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How  this  thesis  differs  Implementing  fast  data  structures  through  an  application’s 
address  space  is  an  interesting  idea,  but  the  restrictions  on  use  (single  producer-consumer 
pairs,  memory  fences)  may  be  trouble  for  complex  data  structures.  Because  HAQu  does  not 
leverage  hardware  parallelism,  it  cannot  achieve  the  speedup  possible  with  a  true  HWDS. 
Future  work  may  consider  how  HWDS  can  provide  isolation  of  producers  and  consumers 
while  maintaining  memory  consistency  in  hardware  and  virtualization  using  the  address 
space  write-through  proposed  by  HAQu. 

2.2.5  Loop  accelerators 

An  approach  for  exploiting  certain  kinds  of  loop-level  parallelism  is  a  loop  accelerator. 
Loop  accelerators  typically  sit  on  the  system  bus  and  directly  access  memory.  Often  they 
are  customized  for  a  particular  loop  or  a  limited  range  of  loop  bodies.  Loop  accelerators 
excel  over  general  purpose  processors  by  exploiting  loops  with  simple  control  flow,  cyclically 
repeated  instruction  streams,  decoupled  memory  accesses  and  computations,  and  domain- 
aware  customizations  of  the  processing  units  (functional  units,  interconnect,  register  hies) 
[33].  The  same  reasons  that  loop  accelerators  are  advantageous  to  use  prevent  them  from 
being  useful  for  complex  or  linked  data  structures.  Branches  within  iteration  cannot  be 
speculated  easily  within  a  loop  accelerator,  so  structures  having  branch  points  such  as  trees 
will  not  be  supported. 

How  this  thesis  differs  Linked  structures  are  hard  to  accelerate  in  a  loop  accelerator 
because  the  address  generation  hardware  is  unable  to  use  simple  computations  to  fetch 
the  required  memory  for  a  loop  body.  Random  access  also  implies  data-dependent  ad¬ 
dress  calculations,  so  certain  array-like  structures  are  not  suitable  for  loop  accelerators. 
These  restrictions  prevent  comparison  of  loop  accelerators  with  HWDSs  because  the  two 
approaches  target  distinct  workloads.  Future  work  can  consider  approaches  that  combine 


22 


loop  acceleration  with  HWDS  support  for  linked  data  structures. 


2.2.6  Scalable  Cores 

Hill  and  Marty  [55]  argue  that  architecture  research  should  pursue  methods  that  provide 
the  ability  to  combine  cores  dynamically  to  boost  the  performance  of  sequential  code — 
Gibson  calls  such  processors  scalable  cores  [51].  CoreFusion  [58],  TRIPS  [110],  Composable 
Lightweight  Processors  [66],  WiDGET  [127],  and  ForwardFlow  [52]  are  scalable  core  archi¬ 
tectures.  Scalable  cores  adapt  dynamically  to  the  needs  of  software  so  that  TLP  is  exploited 
when  sufficient  parallelism  exists,  while  sequential  workloads  benefit  from  aggregations  of 
execution  units. 

How  this  thesis  differs  Scalable  cores  take  an  execution-oriented  view  toward  perfor¬ 
mance  and  choose  between  offering  ILP  or  TLP.  Like  scalable  cores,  this  thesis  improves 
the  performance  of  (data  structure)  code  that  is  hard  to  parallelize  at  a  task  granularity; 
the  difference  is  that  a  data-oriented  view  provides  speedup  to  workloads  that  may  not 
benefit  from  either  ILP  or  TLP  because  the  primary  bottleneck  is  memory. 

2.3  Shipping  Code  to  Data 

2.3.1  Data  structure  co-processing 

Loew  et  al.  [81]  introduce  data  structure  co-processing  as  a  hardware-software  approach 
for  accelerating  data  structure  operations.  This  approach  is  a  model  of  computation  that 
offloads  data  structure  operations  to  a  separate  hardware  thread  or  core.  The  main  draw¬ 
back  of  the  model  is  that  the  offloading  suffers  poor  performance  due  to  synchronization 
and  communication  between  application  and  data  structure  threads. 
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How  this  thesis  differs  This  thesis  couples  HWDSs  with  OS  support  for  applications. 
HWDS  could  improve  data  structure  co-processing  through  core  specialization. 

2.3.2  Processor-in-memory 

Shrinking  memory  bandwidth  with  respect  to  processsor  speed  motivates  intelligent  mem¬ 
ory  (processor-in-memory),  for  example  the  IRAM  project  [98]  and  Active  Pages  [95].  An 
intelligent  memory  architecture  embeds  some  processing  units  close  to  memory,  that  is,  on 
the  same  chip  as  the  memory  modules.  The  processing  units  enable  computations  that 
can  use  memory  at  a  higher  bandwidth  than  a  traditional  CPU  over  a  memory  bus.  Other 
processor-in-memory  projects  include  [98,  44,  59,  47,  70,  21,  31,  119,  46,  140].  OS  support 
such  as  that  of  ActiveOS  [94]  for  Active  Page  enables  intelligent  memory  for  multiprocess 
environments. 

How  this  thesis  differs  HWDSs  differ  from  intelligent  memory  by  taking  advantage 
of  parallelism  within  structured  data;  the  two  approaches  could  be  used  together  with 
a  HWDS  implementing  an  intelligent  memory  processing  unit.  This  study  in  particular 
focuses  on  the  OS  policies  and  support  needed  to  make  HWDSs  work  with  general-purpose 
applications. 

2.3.3  Processor-in-disk 

Disk  I/O  suffers  similar  latency  problems  as  memory,  and  improvements  in  disk  I/O  per¬ 
formance  would  benefit  applications  such  as  databases,  web  transaction  processing,  data 
mining,  and  multimedia.  Early  work  in  database  processors  [114,  96,  79,  111]  reduce  the 
costs  of  relational  database  operations  by  tailoring  circuits  to  access  data  independently 
from  main  processors.  Database  processors  were  abandoned  due  to  inflexibility  and  prob¬ 
lems  with  backward  compatibility  [19],  but  active  disks  and  related  approaches  [6,  104,  64] 
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generalize  database  processors  to  improve  general-purpose  disk  I/O  by  shifting  general- 
purpose  processors  into  disk  controller  interfaces.  The  notion  of  shifting  processing  code  to 
disks  leads  to  semantically-smart  disks  that  integrate  disk  I/O  with  knowledgeable  filesys¬ 
tems  and  applications  [112]. 

How  this  thesis  differs  As  with  the  processor-in-memory  work,  this  thesis  focuses  on 
the  OS  policies  and  support  for  sharing  and  handling  overflow.  Furthermore,  the  impli¬ 
cation  of  intelligent  disks  is  that  either  applications  provide  disk  processing  code,  or  disk 
devices  are  application-aware.  With  HWDSs,  the  abstraction  of  a  data  structure  precludes 
such  tight  integration  between  hardware  and  software. 

2.4  Linked  Prefetching 

Prefetching  is  a  known  commodity  in  modern  computer  architecture.  But  just  as  well- 
known  is  that  prefetching  works  well  in  structures  that  exhibit  high  spatial  locality :  iter¬ 
ating  through  dense  arrays  being  the  best  case.  For  non-local  accesses,  such  as  those  seen 
in  linked  data  structures,  traditional  prefetchers  can  actually  degrade  performance  due  to 
unnecessary  fetches.  Prefetching  of  linked  data  structures  is  a  challenging  research  area 
with  interesting  solutions,  including  correlation-based  prefetching  [62],  pointer  prefetch¬ 
ing  [107,  126,  23],  content-directed  prefetching  [36,  45],  and  push  prefetching  [137,  138]. 
A  novel  solution  also  combines  linked  prefetching  with  intelligent  memory  in  which  a  pro¬ 
grammable  unit  traverses  data  structures  in  memory  and  feeds  the  processor  with  prefetch 
data  [57]. 

How  this  thesis  differs  Unlike  linked  prefetchers,  the  HWDS  implicitly  knows  the  struc¬ 
ture  of  data  so  there  is  no  need  for  logic  to  look-ahead  and  fetch  from  memory.  Linked 
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prefetchers  offer  one  side  of  a  coin — reduce  average  memory  access  times  for  linked  data 
structures — with  a  HWDS  on  the  other  side  of  the  coin:  reduce  data  structure  processing 
times  through  fine-grained  parallelism.  Combining  the  two  approaches  would  be  inter¬ 
esting;  perhaps  a  linked  prefetcher  could  implement  the  overflow/underflow  handler  of  a 
HWDS  independently  of  software. 

2.5  Capability-  and  Object-based  Systems 

From  the  mid  70s  through  the  late  80s,  computer  architects  sought  to  support  capabil¬ 
ities  [42,  78]  and  object  representations  directly  in  hardware  [135,  34,  12,  90,  38].  An 
infamous  commercial  system  is  the  Intel  iAPX  432,  which  featured  capabilities,  object 
addressing,  garbage  collection,  interprocess  communication,  multitasking,  and  multipro¬ 
cessing  [93,  37,  78];  the  iAPX  432  design  failed  due  to  performance  problems  [35]. 

How  this  thesis  differs  Every  language  can  implement  an  object  representation,  so 
direct  hardware  support  for  objects  is  inflexible  and  non-portable,  and  OS  modifications 
are  intrusive — especially  for  hardware  capabilities — and  complex.  HWDSs  implement  an 
abstraction  that  permits  simple  hardware  and  modular  OS  support.  Future  work  can 
extend  this  thesis  to  show  how  to  support  objects  in  a  manner  consistent  with  HWDSs. 

2.6  Transactional  Memory 

Multicore  places  extra  pressure  on  memory:  programs  share  data  and  execute  in  parallel 
contending  for  shared  memory.  Traditional  solutions  for  contention — namely  locking — 
may  not  scale  well  to  multicore  systems.  An  alternative  solution  is  transactional  memory 
(TM)  [54]— in  the  spirit  of  database  transactions — that  provides  an  all-or-none  atomicity 
for  memory  accesses. 
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Hardware  support  for  TM  alleviates  performance  concerns,  and  both  hardware-only 
and  hybrid  hardware/software  TM  systems  have  been  proposed  and  produced  [88,  84,  18, 
43,  39].  Some  of  the  challenges  with  TM  systems  is  integration  with  the  OS,  for  example 
how  to  use  transactions  in  the  presence  of  system  calls  and  I/O  [106,  124], 

How  this  thesis  differs  HWDSs  provide  benefits  to  serial  code  through  fine-grained 
parallelism  within  data  structure  operations,  an  advantage  that  TM  cannot  produce;  TM 
relies  on  the  availability  of  task  parallelism  and  multicore  processing.  Future  work  can 
investigate  how  HWDSs  in  a  multicore  platform  can  provide  HWDS-mediated  sharing  and 
compare  that  with  TM. 

2.7  Summary  of  Related  Work 

None  of  the  prior  art  approaches  the  problem  of  memory  limiting  system  performance  from 
a  HWDS  point-of-view.  Implementation  similarities  between  HWDSs  and  other  systems 
abound,  and  I  have  reviewed  those  which  are  most  similar.  This  thesis  shows  that  OS  sup¬ 
port  elevates  HWDSs  to  improve  applications  in  multitasking  environments.  The  related 
work  suggest  other  areas  that  HWDSs  may  benefit,  such  as  reconfigurable  computing  or 
work-offloading  in  a  multicore  platform. 
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Chapter  3  —  OS  Support  for  HWDSs:  Generalities 


Software  support  can  help  circumvent  the  size  and  sharing  limitations  of  hardware  so  that 
applications  can  benefit  from  HWDSs.  This  chapter  describes  the  generic  aspects  of  such 
support,  and  subsequent  chapters  describe  aspects  that  are  specific  to  the  priority  queue 
and  map  HWDSs. 

3.1  Overflow  Handling 

This  study,  inspired  by  work  in  fine-grained  task- level  parallelism  [73,  109],  adopts  an 
exception-based  HWDS  approach,  as  opposed  to  an  interposition-based  HWDS  [27]  which 
avoids  exceptions  by  checking  (with  software)  before  every  HWDS  access.  The  HWDS 
generates  an  overflow  exception  when  the  size  of  the  data  structure  exceeds  the  capacity 
of  the  hardware.  An  overflow  exception  handler  then  processes  the  exception  by  spilling 
nodes  from  the  HWDS.  When  the  used  capacity  of  the  HWDS  falls  below  a  programmable 
threshold — and  there  are  spilled  nodes — control  logic  raises  an  underflow  exception.  The 
underflow  handler  fills  the  HWDS  from  the  overflow  data  structure. 

Spilling  causes  a  problem  for  operations  that  target  spilled  nodes:  software  must  imple¬ 
ment  the  operation  on  the  nodes  in  the  spill  area.  When  an  operation  fails  while  using  an 
exception-based  HWDS,  control  logic  raises  a  failover  exception  and  the  exception  handler 
emulates  the  operation  on  the  nodes  in  the  overflow  data  structure.  (Interposition-based 
approaches  must  determine  whether  an  operation  should  target  the  spilled  nodes  or  the 
HWDS.) 

Overflow  handling  introduces  the  following  HWDS  instructions: 

•  get-context:  queries  the  context  of  the  HWDS  to  determine  the  cause  of  an  excep¬ 
tion  and  the  interrupted  instruction. 
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•  spill:  Moves  a  node  from  the  HWDS  to  backing  storage. 


•  fill:  Moves  a  node  from  backing  storage  to  the  HWDS. 

A  HWDS  implements  spill  with  the  data  structure’s  delete  operation,  and  fill  as  an 
insert  operation,  where  the  node  to  delete  or  insert  is  chosen  according  to  a  HWDS-specific 
policy.  Sections  4.2  and  5.2.2  describe  these  policies  for  hardware  priority  queues  and 
hardware  maps,  respectively,  get-context  can  be  implemented  in  a  control  unit  alongside 
the  HWDS  that  stores  the  most  recent  HWDS  operation  and  its  arguments. 

Exceptions  allow  software  to  be  oblivious  to  the  HWDS  capacity,  but  they  induce  over¬ 
head  that  reduces  the  throughput  and  predictability  of  applications.  The  cost  imposed  by 
overflow  handling  depends  on  the  implementation  of  the  overflow  data  structure,  frequency 
of  overflow/underflow/failover,  and  the  cost  of  executing  the  exception  handler.  Experi¬ 
mental  evaluations  in  subsequent  chapters  of  this  thesis  quantify  the  costs  of  overflow 
handling. 

3.2  HWDS  Assignment 

Sharing  the  HWDSs  adds  complexity  to  both  hardware  and  support  software.  The  main 
addition  is  that  the  hardware  needs  to  distinguish  data  structures;  in  prior  work,  there 
was  a  one-to-one  mapping  between  data  structure  and  HWDS.  Loosening  that  mapping 
to  many-to-one  introduces  the  problem  that  the  HWDS  must  have  some  way  to  separate 
or  distinguish  data  structures  and  their  operations.  As  with  other  facets  of  HWDS  design, 
more  than  one  solution  exists  for  this  problem.  The  solution  I  adopt  is  to  add  an  identifier 
to  every  instruction  that  accesses  the  HWDS  and  for  the  hardware  to  track  which  data 
structure  currently  is  in  use.  Exception  handlers  use  the  identifiers  to  store  overflow  in 
separate  data  structures.  I  chose  this  approach  because  the  hardware  cost  is  small  (an 
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extra  register  and  some  comparators)  while  supporting  a  wide  range  of  policies  for  how 
HWDSs  are  shared.  The  main  drawback  is  that  each  data  structure  must  have  a  unique 
identifier. 

HWDS  assignment  introduces  the  following  instructions: 

•  save-context:  save  the  data  from  a  HWDS  to  backing  storage  and  make  that  HWDS 
available  for  use 

•  restore-context:  assign  the  HWDS  and  (optionally)  restore  data 

A  HWDS  context  switch  is  a  save-context  followed  by  a  restore-context.  As  the  uti¬ 
lized  capacity  of  a  HWDS  increases,  the  cost  to  save-context  also  goes  up.  In  Section  6.4, 
I  evaluate  assignment  algorithms  that  can  limit  the  usable  size  of  a  HWDS  in  order  to  limit 
the  cost  of  the  context  switch. 

HWDS  assignment  can  be  solved  statically  or  dynamically.  Static  assignment  deter¬ 
mines  offline  which  data  structures  are  assigned  to  use  HWDS  resources  and  at  runtime  a 
HWDS  context  switch  swaps  one  assigned  data  structure  for  another.  Dynamic  assignment 
permits  the  OS  to  make  assignment  decisions  online.  Some  mechanisms  for  dynamic  assign¬ 
ment  are  (1)  permitting  data  structure  operations  to  proceed  without  hardware  support 
(assignment  to  a  software  implementation),  (2)  saving  the  context  of  the  currently  in-use 
HWDS  and  restoring  the  context  of  the  requested  data  structure,  or  (3)  suspend  the  task 
making  the  request  until  a  HWDS  becomes  available.  With  (1),  every  data  structure  opera¬ 
tion  raises  an  exception  that  emulates  the  operation  in  software — a  prohibitively  expensive 
solution.  (Interposition-based  approaches  can  implement  (1)  without  such  expense.)  Mech¬ 
anism  (2)  has  the  drawback  that  it  can  lead  to  a  problem  analogous  to  thrashing;  in  the 
worst  case,  every  access  to  a  HWDS  could  cause  a  save-context.  A  concern  with  (3)  is 
starvation.  This  thesis  presents  and  evaluates  static  assignment  algorithms  and  dynamic 
assignment  using  mechanisms  (1)  and  (2);  future  work  can  evaluate  policies  and  algorithms 
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for  HWDS  assignment  more  thoroughly. 

If  software  attempts  to  access  a  data  structure  that  is  not  presently  in  the  HWDS 
context,  then  control  logic  triggers  an  exception.  The  OS  can  assign  the  data  structure  to 
an  available  HWDS,  save  the  context  of  a  currently  used  HWDS  and  assign  it  to  the  data 
structure,  or  assign  the  data  structure  to  use  software  only. 

3.3  Experimental  Infrastructure 

I  implemented  HWDSs  in  the  Simics/GEMS  simulator  [83] — a  functionally  correct,  cycle- 
accurate  full  system  simulator  for  an  out-of-order  architecture  (based  on  the  Alpha  ar¬ 
chitecture)  that  executes  the  SPARC  v9  instruction  set.  The  architectural  parameters  I 
chose  are  representative  of  an  embedded  system:  75  MHz  CPU,  80  cycle  memory  latency, 
and  a  Tissue  5-stage  pipeline.  The  implementation  extends  the  SPARC  instruction  set  to 
use  a  reserved  opcode  for  HWDS  instructions,  which  are  executed  with  a  new  functional 
unit.  This  functional  unit  operates  atomically  and  non-speculatively.  Although  the  HWDS 
can  achieve  single-cycle  latencies  for  priority  queue  operations,  restricting  the  unit  to  be 
atomic  and  non-speculative  increases  the  latency  to  around  12  cycles  for  the  simulated 
architectural  parameters. 

I  modified  RTEMS  [91]  to  provide  OS  support  for  HWDSs.  OS  modifications  include 
HWDS  exception  handling,  overflow  data  structure  implementations,  task  scheduling  al¬ 
gorithm  implementations,  a  rudimentary  HWDS  interposition  library,  and  macros  to  emit 
HWDS  instructions.  I  also  modified  the  GCC  compiler  to  support  the  HWDS  instructions, 
although  presently  the  only  way  to  emit  these  instructions  is  with  hand-written  assembly. 
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Chapter  4  —  Priority  Queue  HWDS 


This  chapter  shows  how  the  priority  queue  HWDS  can  be  supported  by  the  OS  to  support 
applications  that  may  require  overflow  handling  for  large  data  sets  or  concurrent  access 
to  the  HWDS  resources.  HWDSs  can  be  effectively  used  in  multitasking  environments 
when  the  hardware  is  managed  properly.  Intuitive  solutions  for  sharing  and  overflow  do 
not  achieve  adequate  performance;  in  the  presence  of  overflow,  simply  using  a  well-known 
and  efficient  overflow  data  structure  leads  to  worse  performance  than  using  a  software-only 
data  structure  implementation. 

4.1  Priority  Queue:  an  Example  HWDS 

A  priority  queue  is  a  data  structure  that  contains  key- value  pairs  where  the  key  is  a  priority 
upon  which  the  structure  is  sorted.  Usual  operations  on  a  priority  queue  are: 

•  peek  [first,  top]:  returns  the  highest  priority  node 

•  enqueue  [insert,  push]:  adds  a  new  node 

•  dequeue  [delete-min,  pop]:  removes  and  returns  the  highest  priority  node 

•  change-key  [decrease-key]:  modifies  a  node’s  key  (priority) 

•  extract  [delete]:  removes  a  given  node  regardless  of  priority 

•  merge  [meld]:  combines  two  priority  queues  into  one 

A  priority  queue  is  stable  if  nodes  with  the  same  priority  are  dequeued  in  first-in,  first-out 
(FIFO)  order.  The  importance  of  priority  queues  to  application  performance  can  be  seen 
in  the  examples  of  Section  1.1. 
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4.1.1  Software  priority  queues 


As  many  software  priority  queue  implementations  exist  as  sorting  algorithms:  Any  sorting 
algorithm  can  implement  a  priority  queue  [123].  For  example,  insert-sort  implements  a 
priority  queue  with  a  linked  list,  heap-sort  implements  a  priority  queue  with  a  heap ,  and 
tree-sort  implements  a  priority  queue  with  a  BST.  A  traditional  priority  queue  implemen¬ 
tation  uses  a  heap;  an  implicit  heap,  which  stores  a  binary  heap  as  an  array,  is  a  common 
implementation.  Variants  of  the  heap  include  the  binary  heap,  implicit  heap,  leftist  tree,  bi¬ 
nomial  queue  (binomial  heap),  pagoda,  skew  heap,  Fibonacci  heap  [49],  pairing  heap  [48], 
Brodal  queue  [22],  and  soft  heap  [28].  BSTs  can  implement  priority  queues  by  keeping 
track  of  the  extreme  (min  and  max)  values  in  the  tree.  Common  BST  implementations  of 
a  priority  queue  use  a  red-black  tree  or  a  splay  tree.  An  advantage  of  a  BST  over  a  heap  is 
that  the  BST  can  more  readily  handle  duplicate  keys  (tied  priority). 

In  1986  Jones  concluded  “[i]mplicit  heaps  are  among  the  worst  choices  for  queues  smaller 
than  20  nodes-and  consistently  worse  than  other  priority-queue  implementations”  [61].  But 
in  1996  LaMarca  and  Ladner  [74]  stated  this  rebuttal: 

[T]he  low  memory  overhead  of  implicit  heaps  makes  them  an  excellent  choice  as 
a  priority  queue,  somewhat  contradicting  Jones’s  results.  We  observed  that  the 
high  memory  overhead  of  the  pointer-based,  self-balancing  queues  translated 
into  poor  memory  system  and  overall  performance. 

And  in  2010,  Hendriks  claimed  “[f]or  current  image  analysis  programs,  the  best  implemen¬ 
tation  of  that  priority  queue  is  the  implicit  heap.  It  has  the  smallest  possible  memory 
usage  and  is  faster  than  all  other  implementations  tested. . .  [except]  for  very  large  queue 
sizes  [82].”  These  conclusions  indicate  that  implicit  heaps  are  appealing  for  at  least  some 
applications.  This  thesis  uses  implicit  heaps  as  a  software  priority  queue  because  they  are 
simple  and  work  well  in  common  cases. 
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4.1.2  Hardware  priority  queues 


Hardware  priority  queues  motivate  the  HWDS  approach:  enqueue  and  dequeue  happen 
in  constant  time,  whereas  the  fastest  software  implementations  take  logarithmic  time  for 
at  least  one  of  the  two  operations.  An  example  hardware  priority  queue,  the  shift  register 
priority  queue,  is  shown  in  Figure  4-1.  The  shift  register  priority  queue  is  an  array  of 
priority  and  data  payload  tuples  that  are  sorted  by  priority  value.  A  shift  register  block 
encapsulates  each  tuple,  and  each  block  connects  to  its  two  neighbors.  Global  lines  connect 
all  the  blocks  to  the  input  and  control.  Global  broadcast  lines  limit  the  scalability  of  the 
shift  register  priority  queue,  but  each  block  makes  a  decision  locally  so  that  sorting  happens 
in  parallel,  enqueue  broadcasts  a  new  tuple  to  all  blocks.  Each  block  sends  its  current 
tuple  to  the  left  and  compares  its  current  priority  value,  new  priority,  and  priority  from  the 
right.  If  the  new  priority  is  less  than  the  current  priority,  then  the  block  keeps  its  current 
data.  If  the  new  priority  is  between  the  current  priority  and  the  priority  from  the  right, 
then  the  block  latches  the  tuple.  Otherwise,  the  block  latches  the  right  neighbor’s  tuple. 
dequeue  is  simple,  with  each  block  sending  its  tuple  to  the  right  and  latching  from  the 
left.  Other  hardware  priority  queue  implementations  eliminate  the  global  lines — see  the 
discussion  in  Section  2.1. 

extract  can  be  implemented  in  the  shift-register  priority  queue  by  broadcasting  both 
the  target  payload  and  priority  with  a  new  control  signal,  and  by  adding  comparators  to 
check  the  target  payload  against  the  stored  payload.  The  target  node  shifts  in  its  left 
neighbor.  By  comparing  the  priority  value,  the  lower  priority  nodes  will  know  to  shift  their 
values  to  the  right  and  latch  values  from  the  left.  Two  problems  present  themselves:  the 
high  cost  of  comparators  and  insufficient  knowledge  at  nodes  that  have  the  same  priority 
value  as  the  target.  A  solution  to  the  former  is  to  replace  the  payload  with  a  tag,  which 
can  be  sized  according  to  the  length  of  the  hardware  priority  queue  instead  of  the  size  of  a 
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New  entry 


Figure  4-1:  A  priority  queue  implemented  in  hardware. 

pointer.  This  solution  increases  latency  since  peek  needs  to  translate  a  tag  to  a  pointer  and 
vice  versa  for  extract:  a  CAM  can  implement  tag  translation  efficiently.  Using  tags  does 
provide  an  advantage  by  reducing  the  storage  and  comparison  cost  for  payloads.  Sorting 
nodes  that  tie  in  priority  by  payload  (tag)  solves  the  latter,  and  can  be  done  in  parallel 
with  the  priority  comparisons,  so  although  sorting  ties  by  payload  adds  work  to  enqueue, 
it  does  not  affect  latency.  However,  sorting  by  payload  dictates  policy  to  the  priority  queue 
mechanism,  which  is  not  in  the  spirit  of  this  thesis.  (As  is,  the  hardware  priority  queue 
implements  FIFO  on  ties,  which  dictates  a  policy  that  supports  a  stable  priority  queue.) 
For  now,  extract  is  modeled  with  the  same  latency  as  enqueue. 

A  systolic  priority  queue  [77]  might  provide  more  flexible  policies  by  instructing  the 
nodes  lower  than  the  target  to  shift  explicitly,  and  tag  lookup  might  be  pipelined  or  proceed 
in  parallel  with  the  first  systolic  block.  Future  work  can  evaluate  implementations  of 
extract  for  different  hardware  priority  queue  structures. 
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4.2  Handling  Overflow  with  a  Priority  Queue  HWDS 


The  hardware  mechanism  for  fill  is  simply  enqueue,  but  spill  requires  an  operation 
that  can  return  a  value  from  an  arbitrary  position  within  the  HWDS — in  particular,  the 
ability  to  extract  the  last  (lowest  priority)  node  in  the  queue. 

An  intuitive  solution  for  overflow  handling  would  use  a  binary  heap  as  an  overflow  data 
structure — Chandra  and  Sinnen  [27]  use  one.  But  blindly  enqueuing  sorted  data  into  a 
binary  heap  is  wasteful.  (Indeed,  inserting  nodes  sorted  low-to-high  maximizes  the  work 
done  in  a  min-heap  that  inserts  nodes  at  a  leaf  and  heapifies  up.)  By  leveraging  the 
knowledge  that  the  data  are  sorted,  overflow  handling  can  make  more  intelligent  decisions. 

Consider  instead  a  sorted  linked  list  implementation  of  the  overflow  data  structure  that 
merge-sorts  overflow  nodes.  Suppose  the  number  of  overflow  nodes  is  k  and  the  size  of  the 
overflow  data  structure  is  n.  The  cost  of  overflow  then  is  approximately  k*lg(n )  for  a  binary 
heap  and  k  +  n  for  a  linked  list,  so  when  k  >  ig^)_\  the  linked  list  will  outperform  the 
binary  heap.  With  an  exception-based  approach,  the  amount  of  work  done  during  overflow 
(k)  should  be  tuned  to  amortize  the  cost  of  the  exception  handler  while  minimizing  the 
future  costs  of  exceptions.  With  a  HWDS  of  128  nodes  and  k  =  64  so  that  half  the  nodes 
are  removed  during  overflow,  the  linked  list  approach  should  outperform  the  binary  heap 
for  priority  queues  less  than  about  512  nodes.  In  practice  the  operation  costs  for  the  two 
differ  enough  that  the  linked  list  approach  is  superior  to  the  binary  heap  for  even  larger 
sizes,  but  eventually  the  linear  scaling  factor  of  the  linked  list  does  limit  performance  as 
the  size  of  the  priority  queue  grows. 

I  implemented  both  binary  heap  and  linked  list  overflow  data  structures.  The  binary 
heap  implementation  is  a  split  HWDS:  an  overflow  data  structure  that  does  not  take 
advantage  of  the  HWDS.  The  linked  list  implementation  is  a  united  HWDS:  an  overflow 
data  structure  that  leverages  structural  locality  and  the  HWDS  capabilities. 
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For  a  priority  queue,  the  overflow  handling  needs  to  be  augmented  slightly  to  ensure 
that  ordering  violations  do  not  exist  between  high-priority  nodes  in  the  overflow  data 
structure  and  lower-priority  nodes  in  the  HWDS.  Hardware  modifications  are  necessary 
to  mark  the  lowest  priority  node  remaining  in  the  hardware  priority  queue  after  spilling. 
Hardware  will  also  mark  nodes  when  they  are  enqueued  with  a  lower  priority  than  a  marked 
node.  In  a  shift-register  priority  queue,  this  marking  requires  a  node  to  consult  with  its 
right  neighbor  when  latching  a  new  entry.  When  the  head  of  the  priority  queue  is  marked, 
control  logic  triggers  an  underflow  exception.  The  underflow  handler  fills  the  HWDS  and 
clears  the  mark  on  nodes  with  higher  priority  than  the  lowest  priority  node  remaining  in 
the  spill  region. 

4.3  Experiments 

Priority  queues  are  the  critical  data  structure  in  applications  and  systems  software — some 
uses  include  planning,  image  processing,  simulations,  timer  management,  and  task  schedul¬ 
ing.  This  section  describes  two  application  domains,  discrete  event  simulation  and  plan¬ 
ning,  and  the  experiments  conducted  to  validate  and  evaluate  the  contributions  of  this 
thesis  for  handling  overflow  and  sharing  for  priority  queue  HWDSs. 

I  implemented  software-only  priority  queues  and  the  priority  queue  HWDS  using  the 
experimental  infrastructure  described  in  section  3.3.  For  the  software-only  priority  queue 
implementations,  I  implemented  a  heap  (implicit)  and  a  splay  tree.  I  also  implemented 
these  priority  queues  as  overflow  data  structures  for  split  HWDSs,  in  addition  to  the  linked 
list  united  HWDS  that  is  described  in  section  4.2. 
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4.3.1  Discrete  event  simulation 


Discrete  event  simulations  can  spend  up  to  40%  of  execution  time  managing  the  pending 
event  set,  which  is  implemented  efficiently  as  a  priority  queue  [105].  Synthetic  benchmarks 
that  model  the  pending  event  set  are  used  to  evaluate  priority  queue  implementations 
[61,  105], 

One  model  of  the  pending  event  set,  the  classic  hold  model  [61],  is  useful  for  bench¬ 
marking  priority  queue  performance  with  a  HWDS.  A  benchmark  in  the  classic  hold  model 
executes  in  two  phases:  the  first  phase  slowly  builds  a  priority  queue  to  a  predetermined 
maximum  size,  and  the  second  phase  executes  a  series  of  hold  operations — a  dequeue  of 
the  highest  priority  node,  incrementing  the  priority  of  the  dequeued  node,  and  an  enqueue 
of  the  node.  The  classic  hold  model  is  appropriate  for  evaluating  HWDSs  because  the 
maximum  size  of  the  priority  queue,  which  is  a  critical  performance  parameter,  remains 
fixed  throughout  the  second  phase  of  the  benchmark.  The  variables  that  affect  perfor¬ 
mance  in  the  hold  model  of  a  software-implemented  priority  queue  are  its  implementation, 
size,  shape  (balance),  distribution  of  priorities,  and  the  distribution  of  priority  increment 
values.  A  hardware  priority  queue  must  consider  the  maximum  capacity  of  the  hardware 
and  the  costs  for  overflow. 

In  order  to  evaluate  the  efficacy  of  overflow  handling  and  sharing,  I  implemented  a 
microbenchmark  based  on  the  classic  hold  model  and  conducted  experiments  to  evaluate 
the  performance  of  the  overflow  handling  and  sharing  support  described  in  Sections  3.1,  3.2, 
and  4.2.  I  obtain  cycle-accurate  measurements  of  execution  time  that  permit  calculating 
a  precise  average  execution  time  for  each  insert  and  hold  operation  during  phase  one, 
and  for  each  hold  operation  in  phase  two;  smaller  numbers  are  better  for  the  hold  model 
benchmarks. 
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Figure  4-2:  Overflow  data  structure  implementation  matters.  The  average  hold  time  of  a 
priority  queue  HWDS  with  infinite  capacity,  a  binary  heap  split  HWDS,  and  a  linked  list 
united  HWDS  compared  to  software  priority  queues.  Mean  hold  time  is  averaged  across 
1024  hold  operations;  data  structure  size  is  in  number  of  nodes. 


Overflow  handling  for  large  priority  queues 

The  first  set  of  hold  model  experiments  establish  the  need  for  intelligent  management  of 
overflow  data.  These  experiments  build  up  the  priority  queue  to  a  maximum  size  that 
varies  between  64  and  1024  nodes  by  powers  of  2,  and  execute  n  hold  operations  (where  n 
is  1024  or  16384). 

Figure  4-2a  shows  the  obvious  benefit  when  a  single  application  uses  an  infinite-size 
HWDS-  -hardware  is  faster  than  software,  an  unsurprising  result.  (Note  that  the  average 
cost  is  around  100  cycles  for  infinite  hardware  because  the  hold  time  includes  3  HWDS 
operations  and  one  arithmetic  operation,  as  well  as  memory  operations  to  fetch  the  priority 
increment  amount  and  benchmark  code.)  Figure  4-2b  is  more  interesting-  it  shows  how 
overflow  handling  using  the  intuitive  approach  of  a  heap  as  an  overflow  data  structure 
performs  poorly. 

If  the  knowledge  that  the  HWDS  contains  sorted  data  is  leveraged,  a  128-node  HWDS 
outperforms  software-only  solutions  even  in  the  presence  of  overflow,  as  shown  in  Fig¬ 
ure  4-2c.  If  one  considered  only  the  intuitive  approach,  opportunity  for  improvement  from 
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(a)  128-node  HWDS. 


(b)  1024-node  HWDS. 


Figure  4-3:  Performance  of  software  and  hardware  priority  queues  averaged  across  4,096 
hold  operations  with  the  same  ratio  of  HWDS  capacity  to  data  structure  size  for  128-  and 
1024-node  HWDSs. 


intelligent  management  of  overflow  data  would  be  missed.  These  results  indicate  that  the 
OS  support  for  overflow  handling,  in  particular  the  use  of  a  united  HWDS,  is  a  useful 
contribution  for  improving  the  performance  of  at  least  some  kinds  of  applications  that  use 
priority  queues. 

Figure  4-3  shows  how  increasing  the  capacity  of  the  HWDS  affects  performance,  and 
how  the  performance  trends  are  similar  for  a  fixed  ratio  of  HWDS  capacity  to  the  number 
of  nodes  in  the  data  structure.  Note  that  for  this  benchmark  the  merge-sorted  linked  list 
united  HWDS  outperforms  software  when  the  data  structure  size  is  less  than  16  times  the 
HWDS  capacity  for  both  the  128-  and  1024-node  priority  queue  HWDSs.  These  results 
demonstrate  that  larger  data  structures  can  be  handled  by  proportionally  larger  HWDSs 
using  the  same  policies  and  OS  support  as  the  smaller  HWDSs. 

The  last  experiment  with  a  single  task  accessing  an  unshared  HWDS  evaluates  the 
effectiveness  of  the  united  HWDS  under  an  increased  number  of  hold  operations  (16,384) 
and  varying  the  probability  distribution  of  the  priority  increment,  which  is  an  important 
parameter  for  determining  performance  of  a  priority  queue  implementation.  Figure  4-4 
shows  the  results  for  this  experiment.  With  respect  to  the  results  shown  in  Figure  4-2b, 
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(a)  Exponentially  distributed  (b)  Biased  priority  increment  (c)  Bimodal  priority  increment 
priority  increment.  (toward  FIFO) 


Figure  4-4:  Performance  of  software  and  hardware  priority  queues  averaged  across  16384 
hold  operations. 


the  benefits  of  the  united  HWDS  persist,  or  even  improve,  with  more  work  (hold  opera¬ 
tions).  In  terms  of  the  priority  increment  distribution,  the  united  HWDS  does  well  with  an 
exponential  (negative  log)  distribution  and  one  that  is  biased  toward  FIFO  behavior — the 
good  performance  on  the  biased  distribution  may  seem  surprising,  since  the  biased  values 
ought  to  cause  overflow  regularly,  but  the  implementation  of  the  overflow  data  structure 
plays  a  part.  The  merge  sort  iterates  from  the  end  of  the  overflow  linked  list  toward  the 
start,  and  the  overflow  nodes  presumably  will  be  toward  the  rear  of  the  overflow  list  because 
of  the  bias,  so  the  overflow  handler  does  not  need  to  traverse  as  much  of  the  data  structure. 
The  HWDS  outperforms  the  binary  heap  with  all  three  distributions,  and  underperforms 
the  splay  tree  only  with  the  bimodal  distribution. 


HWDS  assignment  for  multiple  priority  queues 

I  created  two  kinds  of  multi-tasking  pending  event  set  benchmarks.  The  first  kind  uses 
tasks  that  each  access  a  private  priority  queue  of  the  same  fixed  maximum  size.  The  second 
kind  also  uses  tasks  that  access  their  own  priority  queue,  but  the  maximum  size  varies — in 
particular,  each  task  has  a  maximum  size  exactly  half  that  of  the  next  largest,  with  a 
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(a)  Same  size  priority  queues,  ex-  (b)  Different  size  priority  queues,  (c)  Different  size  priority  queues, 
ponentiai  distribution.  exponential  distribution.  priority  increment  biased  toward 

FIFO. 


Figure  4-5:  Four  tasks  sharing  a  hardware  priority  queue  with  1024  hold  operations. 


smallest  maximum  size  of  16.  Varying  the  maximum  size  changes  which  data  structures 
will  benefit  most  from  using  the  HWDS.  The  task  scheduler  is  a  preemptive  time-slicing 
round-robin  scheduler  that  allocates  a  10  millisecond  time  slice  to  each  task  in  each  round. 

Figure  4-5  shows  the  effect  of  sharing  HWDS  resources  on  both  kinds  of  multi-tasking 
benchmarks  with  an  assignment  algorithm  that  permits  any  data  structure  to  utilize  the  full 
capacity  of  the  HWDS.  Figures  4-5a  and  4-5b  are  the  first  and  second  kind  of  benchmark 
described  in  the  previous  paragraph.  Figure  4-5c  is  the  second  kind  of  benchmark,  but 
with  a  priority  increment  distribution  that  is  biased  toward  FIFO  queue  access.  These 
results  show  that  sharing  imposes  a  cost  even  for  an  infinite-capacity  HWDS,  because  the 
context  switch  must  save  and  restore  data  in  the  HWDS. 

In  Figure  4-5a,  a  large  spike  in  performance  is  seen  near  the  1024-node  priority  queue. 
This  spike  is  due  to  the  increasing  cost  of  context  switching,  which  is  causing  more  context 
switches  to  occur  because  the  workload  is  not  finishing  as  quickly.  The  performance  of 
HWDS  in  Figure  4-5b  at  points  2048  and  4096  owes  its  performance  to  the  smaller  sizes 
included — the  2048  point  includes  a  256,  512,  and  1024  queue  in  addition  to  the  2048,  and 
the  smaller  queues  perform  better  with  a  HWDS  than  with  software.  The  performance  of 
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Figure  4-6:  Multitask  sharing  of  same-sized  priority  queues  with  4096  hold  operations  and 
three  sizes  of  HWDS. 


the  HWDS  when  the  priority  increment  is  biased  illuminates  the  fact  that  the  size-limited 
HWDS  actually  outperforms  the  infinite-capacity  HWDS.  The  performance  benefit  is  due 
to  the  lesser  cost  of  context  switching  a  size- limited  HWDS,  which  motivates  experiments 
in  Section  5.3  that  limit  the  permissible  HWDS  size  which  a  data  structure  may  use. 

Figure  4-6  shows  how  increasing  the  number  of  hold  operations  affects  performance  for 
the  first  kind  of  benchmark;  these  results  also  show  how  increasing  the  size  of  the  HWDS 
shifts  the  performance  curve.  Comparing  Figure  4-6  with  Figure  4-5a,  as  the  number  of 
operations  increases,  the  performance  of  the  HWDS  improves  (at  least  for  the  observed 
parameter  range).  The  performance  of  the  infinite-capacity  HWDS  does  worse  than  the 
1024-node  HWDS  for  the  larger  queue  sizes  because  the  cost  to  context  switch  all  of  the 
nodes  from  the  infinite-capacity  HWDS  is  larger  than  the  cost  to  context  switch  the  smaller 
HWDS.  This  performance  loss  due  to  context  switching  justifies  HWDS  assignment  that 
limits  the  usable  size  of  a  HWDS  in  order  to  constrain  the  context  switch  cost,  which  I 
explore  in  Section  6.3. 

Figures  4-7a  and  4-7b  show  that  the  biased  and  bimodal  distributions  do  affect  the 
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(a)  Biased  priority  increment.  (b)  Bimodal  priority  increment. 


Figure  4-7:  Multitask  sharing  of  same-sized  priority  queues  with  4096  hold  operations  and 
varying  priority  increment  distributions. 


HWDS  overflow  performance  with  sharing.  Even  with  the  cost  of  context  switching,  the 
HWDS  still  performs  better  than  a  binary  heap  at  the  given  data  structure  sizes,  although 
the  splay  tree  does  better  in  general  and  especially  for  the  bimodal  distribution. 

The  experimental  results  using  multiple  tasks  that  share  a  HWDS  indicate  that  even 
simple  HWDS  assignment  can  improve  performance.  The  results  also  motivate  further 
investigation  into  smarter  HWDS  assignment,  which  is  described  in  the  subsequent  chapters 
of  this  thesis. 


4.3.2  Planning  algorithms 

An  important  algorithm  that  makes  heavy  use  a  priority  queue  is  Dijkstra’s  shortest-path 
algorithm,  which  is  used  for  routing  in  network  devices,  navigation  in  GPS  devices,  and  as 
a  basis  for  the  A*  family  of  path-planning  algorithms.  Dijkstra’s  algorithm  benefits  from 
change-key,  which  makes  handling  overflow  more  challenging.  Chandra  and  Sinnen  [27] 
show  that,  if  change-key  is  restricted  to  increasing  priority,  then  inserting  a  new  node  with 
the  updated  priority  value  and  allowing  the  old  node  to  be  stale  emulates  change-key  in 
a  hardware  priority  queue.  This  solution,  however,  increases  pressure  on  the  hardware 
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priority  queue  size  by  adding  new  nodes  when  updating  a  node.  Instead,  I  implemented 
change-key  as  a  meta-operation  that  combines  extract  followed  by  enqueue  with  the 
new  priority;  a  single  macro  implements  this  meta-operation  so  that  the  user  is  unaware 
of  implementation  details.  Software  can  implement  change-key  directly  since  the  meta¬ 
operation  may  be  less  efficient  than  a  direct  implementation  for  some  data  structures,  for 
example  the  binary  heap. 

To  evaluate  the  cost  of  overflow  handling,  I  use  a  version  of  Dijkstra’s  algorithm  that 
is  executed  on  real-world  maps  taken  from  the  9th  DIMACS  shortest  path  implementation 
challenge  benchmarks  [26].  For  the  software  data  structure  implementation,  the  SmartQ 
implementation  provided  with  the  challenge  benchmarks  is  used.  I  compare  SmartQ  with 
a  modified  benchmark  that  uses  a  hardware  priority  queue. 

The  DIMACS  GPS  benchmarks  evaluate  both  the  potential  limit  of  improvement  for 
HWDSs  and  the  performance  that  is  obtained  using  the  overflow  support  described  in 
Section  4.2  when  the  capacity  of  the  hardware  priority  queue  is  less  than  the  application’s 
data  needs. 

To  fold  the  potential  limit  of  improvement,  I  instrumented  the  benchmarks  with  per¬ 
formance  counters  to  measure  the  maximum  size  of  the  priority  queue,  the  number  of 
priority  queue  operations  (enqueue  and  dequeue)  that  execute,  and  the  percent  of  time 
that  each  benchmark  spends  on  priority  queue  operations.  The  benchmarks  are  executed 
using  timing  mechanisms  that  are  provided  with  the  challenge  code.  These  timers  query 
the  host  system  for  the  user  time  of  the  process  running  the  application.  The  timing  elides 
all  startup  and  shutdown  costs.  To  time  individual  operations,  timer  calls  are  added  before 
and  after  each  priority  queue  operation  and  ran  the  application  both  unmodified  and  with 
the  timer  calls.  The  difference  in  total  time  taken  between  the  two  runs  is  the  overhead 
for  making  the  extra  timer  calls,  half  of  which  is  deducted  from  the  sum  of  the  time  taken 
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Table  4-1:  Priority  queue  behavior  in  selected  DIMACS  GPS  benchmarks 
Input  Max  Size  Operations  Time 


New  York  City  (NY) 

925 

528693 

28.5% 

San  Francisco  (BAY) 

886 

642540 

27.1% 

Colorado  (COL) 

945 

871332 

30.1% 

for  priority  queue  operations  (because  the  time  accounted  toward  the  priority  queue  oper¬ 
ations  includes  half  of  the  timer  overhead).  Then  the  ratio  of  the  time  taken  for  priority 
queue  operations  to  the  total  time  taken  by  the  unmodified  application  is  a  measure  for 
the  amount  of  time  spent  by  the  application  in  the  PQ. 

I  gathered  performance  counters  for  all  of  the  USA  road  distance  benchmarks  in  the 
challenge:  Table  4-1  summarizes  the  measurements  for  the  challenge  benchmarks  used  in 
the  following  experiment  to  evaluate  overflow  handling  for  real-world  data  sets.  The  full 
set  of  measurements  is  presented  in  Section  6.4. 

The  performance  measurements  show  that  up  to  30%  of  the  execution  time  of  the 
benchmarks  is  spent  executing  priority  queue  operations.  This  value  gives  an  estimate  of 
the  upper  bound  of  performance  improvement  from  HWDSs. 

I  also  executed  modified  versions  of  the  smallest  three  benchmarks  with  the  Sim- 
ics/GEMS  experimental  infrastructure.  The  duration  of  the  benchmarks  is  reduced  to 
issue  5  path  queries;  this  reduction  is  necessary  so  the  benchmarks  terminate  in  a  reason¬ 
able  amount  of  time  when  executed  under  cycle-accurate  simulation. 

Figure  4-8  shows  the  results  from  executing  these  benchmarks,  with  the  performance 
calculated  as  a  percent  improvement  versus  the  SmartQ  implementation.  Note  that  the 
maximum  size  of  the  priority  queue  for  these  three  inputs  is  less  than  1024  (but  greater 
than  800).  When  the  priority  queue  HWDS  is  size  1024  there  is  no  overflow  and,  as 
expected,  the  performance  improvement  is  close  (within  about  4  to  7  percentage  points) 
to  the  total  amount  of  time  spent  in  the  priority  queue  as  measured  and  reported  in 
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table  4-1.  More  interesting  is  the  performance  when  overflow  does  occur.  With  a  512-node 
hardware  priority  queue,  the  performance  of  two  of  the  benchmarks  is  still  close  to  that  of 
the  non-overflow.  Even  when  a  256-node  hardware  priority  queue  is  used,  the  BAY  and 
COL  benchmarks  still  obtain  practical  performance  improvements.  The  NY  benchmark 
has  negative  performance  with  a  256-node  hardware  priority  queue  size  that  might  be 
attributed  to  the  ratio  of  priority  queue  operations  to  maximum  priority  queue  size. 

The  experiments  with  the  classic  hold  model  suggested  that  increasing  the  number  of 
operations  while  maintaining  the  queue  size  leads  to  improved  HWDS  performance,  and 
the  same  appears  to  be  the  case  with  the  GPS  benchmark.  Finding  the  ideal  ratio  would  be 
an  interesting  study.  These  results  demonstrate  that  a  priority  queue  HWDS  can  benefit 
real-world  application  software  because  of  the  OS  support  for  overflow  handling  introduced 
in  this  thesis. 


Hardware  Priority  Queue  Size 


Figure  4-8:  Performance  of  priority  queue  HWDS  as  percent  improvement  over  SmartQ 
with  modified — shortened  to  5  queries — DIMACS  GPS  benchmarks. 
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Execution  Time  for  Colorado  Benchmark 


■  SmartQ  ■  SplitHWDS  UnitedHWDS 


128  256  512  1024 

Hardware  Priority  Queue  Size 


Figure  4-9:  Comparison  of  United  HWDS  with  Split  HWDS.  Execution  time  of  one  itera¬ 
tion  of  GPS  challenge  benchmark  on  Colorado  input  using  the  OS  support  proposed  by  this 
thesis  (UnitedHWDS)  compared  to  prior  art  (SplitHWDS)  [27]  normalized  to  software-only 
(SmartQ).  Larger  is  better. 

A  last  experiment  with  the  DIMACS  GPS  benchmark  evaluates  how  the  exception- 
based  united  HWDS  approach  proposed  and  implemented  in  this  thesis  compares  with  the 
interposition-based  split  HWDS  proposed  and  implemented  by  Chandra  and  Sinnen  [27]. 
I  implemented  an  interposition-based  split  HWDS  that  uses  a  binary  heap  as  the  overflow 
data  structure,  and  I  modified  the  DIMACS  challenge  code  to  use  this  HWDS  and  to 
ignore  updates  (change-key)  to  nodes.  This  implementation  is  equivalent  to  what  has 
been  proposed  in  the  related  work.  Figure  4-9  shows  the  normalized  (to  the  SmartQ) 
execution  times  for  one  iteration  of  the  Colorado  GPS  challenge  benchmark  using  the 
SmartQ,  the  interposition-based  split  HWDS  proposed  by  others,  and  the  exception-based 
united  HWDS  that  this  thesis  espouses;  larger  numbers  are  better.  For  sizes  over  128,  the 


48 


united  HWDS  improves  performance  as  shown  earlier.  With  a  HWDS  size  of  128,  neither 
HWDS  approach  does  as  well  as  SrnartQ — indeed,  the  split  HWDS  never  does  better  than 
software. 

4.4  Summary 

This  chapter  demonstrated  the  OS  support  for  HWDSs  using  a  well-known  HWDS,  the 
hardware  priority  queue.  An  extract  operation  is  proposed  for  the  shift-register  hard¬ 
ware  priority  queue.  A  united  HWDS  is  described  and  evaluated,  and  its  performance 
is  compelling  on  both  discrete  event  simulation  and  GPS  navigation  benchmarks  using 
real-world  data.  HWDS  sharing  is  evaluated  with  a  multitasking  benchmark  that  re-uses 
the  discrete  event  simulation  benchmark  framework.  The  next  chapter  introduces  the  map 
HWDS,  which  demonstrates  how  the  OS  support  for  the  priority  queue  HWDS  translates 
to  another  useful  data  structure. 
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Chapter  5  —  Map  HWDS 


A  map  is  a  data  structure  that  organizes  data  to  support  efficient  searching.  Searching  is 
a  fundamental  problem  in  computing:  return  the  node  with  a  specific  key  from  a  set  of 
(key,  value)  nodes.  The  specified  key  is  the  argument  to  the  search  [68].  Usual  operations 
on  a  map  are: 

•  insert:  adds  a  new  node 

•  extract:  removes  a  given  node 

•  change-value:  modifies  a  node’s  value 

•  search:  finds  a  node  with  the  given  argument 

A  search  can  be  exact  or  approximate  if  the  returned  node  has  the  same  or  closest  key  as  the 
argument  respectively.  Keys  can  have  arbitrary  length  and  meaning;  common  keys  include 
numbers,  strings,  indices,  and  hash  values.  If  the  search  compares  key  and  argument 
directly  then  it  is  a  comparison  search ;  a  digital  search  relies  on  the  binary  representation 
of  the  argument  to  find  the  key.  The  skewness  of  a  search  is  a  measure  of  the  asymmetry 
of  the  probability  distribution  of  arguments;  text  search  tends  to  be  strongly  skewed,  so 
skewness  is  an  important  parameter  to  consider  when  evaluating  solutions  for  searching. 
This  thesis  considers  exact  comparison  search  with  numerical  keys  with  varying  skewness 
and  maps  that  use  insert,  extract,  and  search;  future  work  may  consider  other  kinds 
of  search  problems  and  maps  that  support  a  change-value  operation. 

5.1  Software-based  Search 

Common  data  structures  that  support  efficient  searching  are  the  BST,  balanced  trees,  self- 
adjusting  trees,  hash  tables,  and  multiway  trees;  Knuth  [68]  describes  these  in  great  detail  in 
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his  textbook.  Balanced  trees,  such  as  the  AVL  and  red-black  trees,  ensure  0(log(n))  search 
(and  insert,  remove)  operations.  Self-adjusting  trees,  such  as  the  splay  tree,  relocate  nodes 
within  the  tree  so  that  frequently  accessed  nodes  are  located  nearer  the  root  to  improve 
performance  for  skew  search.  Probabilistic  search  structures,  such  as  the  skip  list  [102],  use 
randomization  for  faster  creation  and  maintenance  and  provide  probabilistic  algorithmic 
performance. 

Bell  and  Gupta  [13]  evaluated  numerical  comparison  search  using  BSTs,  AVL  trees, 
and  splay  trees;  I  adopt  their  evaluation  benchmarks  to  evaluate  the  OS  support  for  map 
HWDSs.  Their  findings  indicate  that  AVL  trees  outperform  the  other  trees,  although  the 
gap  closes  when  data  are  skewed.  While  surprising,  their  results  have  also  been  shown  by 
others  for  string  search:  Williams  et  al.  [131]  found  that  BSTs  outperform  treaps,  splay 
trees,  and  red-black  trees;  a  modified  splay  tree  does  improve  over  BSTs. 

5.2  Map  HWDS 

Hardware  can  search  small  sets  of  records  with  numerical  keys  efficiently  with  a  CAM,  but 
it,  like  a  hardware  priority  queue,  does  not  support  overflow  handling  or  sharing  directly.  A 
CAM  also  does  not  support  direct  comparison  searches  except  for  specialized  uses  in  which 
the  values  are  integers  that  fall  within  the  address  range  of  the  CAM;  such  is  the  case 
for  the  page  table-TLB  that  is  described  in  section  2.1.6.  However  the  solutions  presented 
earlier  in  chapter  3  do  translate  to  CAM-based  (and  other)  map  HWDSs. 

5.2.1  CAM-based  map  HWDS 

An  implementation  of  a  map  HWDS  can  use  a  CAM  and  a  fast  random-access  memory 
(RAM) — such  as  SPM — that  are  the  same  size.  To  implement  insert,  the  HWDS  stores 
the  key  in  the  CAM  at  an  available  location,  and  stores  the  value  in  the  same  location 
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in  the  RAM,  marking  the  location  unavailable.  (The  entire  addressable  range  is  marked 
available  during  initialization.)  An  extract  does  a  search  for  the  key,  marks  the  memory 
at  the  returned  location  as  available,  and  returns  the  node.  During  a  search,  the  HWDS 
control  logic  passes  the  argument  to  the  CAM  to  obtain  the  location,  indexes  the  RAM  at 
that  location  to  get  the  value,  and  returns  the  node  comprising  key  and  value. 

5.2.2  Overflow  handling 

The  hardware  mechanisms  for  spill  and  fill  are  extract  and  insert.  Unlike  the  priority 
queue  HWDS,  I  am  unaware  of  any  united  HWDS  for  maps.  Therefore  any  efficient  map 
data  structure  implements  an  appropriate  overflow  data  structure.  I  implemented  three 
such  structures:  a  red-black  tree,  a  splay  tree,  and  a  skip  list. 

5.2.3  Least  recently  used  (LRU)  spilling  and  fill-after-search 

Skewed  search  provides  an  opportunity  for  more  intelligent  overflow  handling.  In  particular, 
a  strongly  skewed  search  will  repeat  some  arguments  more  often,  which  indicates  that 
temporal  locality  may  be  exploited.  To  evaluate  whether  temporal  locality  in  overflow 
handling  makes  a  difference,  I  re-implemented  spill  to  remove  the  least  recently  used 
(LRU)  item  from  the  map  HWDS  and  for  failover  during  search  to  execute  a  fill  if 
the  node  is  found.  LRU-based  overflow  with  fill-after  search  attempts  to  exploit  temporal 
locality  in  skewed  searches. 

5.2.4  Size  checks 

Even  with  intelligent  overflow  handling,  when  the  size  of  a  map  exceeds  the  capacity  of 
the  hardware  by  a  sufficient  amount  the  performance  of  a  map  HWDS  is  worse  than 
just  using  software.  I  implemented  a  simple  HWDS  assignment  algorithm  that  detects  if 
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the  requested  size  of  a  data  structure  exceeds  the  HWDS  capacity  and,  if  so,  assign  to 
software-only.  The  result  of  the  check  hooks  software  function  calls  that  can  either  go  to 
a  HWDS  or  a  software  implementation.  Currently  this  check  is  done  only  ahead  of  time 
with  the  cooperation  of  application  software;  future  work  can  consider  a  dynamic  change, 
which  would  likely  demand  the  use  of  an  interposition-based  HWDS  or  additional  hardware 
support. 

5.2.5  Dynamic  eviction 

When  applications  search  and  extract  in  the  overflow  data  structure,  the  performance 
of  a  HWDS  suffers,  especially  with  an  exception-based  HWDS.  Another  opportunity  to 
improve  performance  is  to  detect  these  conditions  and  prevent  them  from  happening  if 
possible.  A  simple,  direct  method  is  to  evict  the  data  structure  from  the  hardware  and  rely 
solely  on  a  software  implementation.  This  method  requires  an  interposition-based  HWDS, 
since  otherwise  every  single  data  structure  operation  would  cause  a  failover  exception.  I 
implemented  a  basic  interposition-based  HWDS  to  study  the  effect  of  dynamic  eviction. 
The  HWDS  assignment  policy  using  eviction  decides  to  assign  a  data  structure  to  software- 
only  when  an  extract  is  detected  that  targets  a  node  in  the  overflow  data  structure. 


5.3  Experiments 

Maps  are  the  critical  data  structure  in  applications  and  systems  software — some  examples 
include  language  interpreters,  key- value  stores,  virtual  memory  address  mapping,  sched¬ 
ulers,  and  timers.  This  section  describes  a  synthetic  search  benchmark  and  experiments 
conducted  to  validate  and  evaluate  the  OS  support  for  overflow  handling  and  sharing 
proposed  in  this  thesis  for  map  HWDSs.  I  implemented  software  maps  and  the  map 
HWDS — described  earlier  in  Section  5.2.1  -in  the  experimental  infrastructure  described  in 
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Section  3.3.  The  software-only  map  implementations  include  a  red-black  tree,  splay  tree, 
and  skip  list.  I  also  implemented  these  software  maps  as  overflow  data  structures  for  split 
HWDS. 

To  test  the  map  HWDS,  I  implemented  a  synthetic  benchmark  described  by  Bell  and 
Gupta  [13]  that  has  four  steps: 

1.  Select  unique  integer  keys  at  random  from  a  uniform  distribution. 

2.  Insert  every  key  in  each  tree  under  test  and  in  the  access  probability  table ,  a  table 
containing  pairs  of  key  and  probability  of  access  that  is  sorted  by  probability;  prob¬ 
ability  values  affect  skewness  of  key  access  and  are  drawn  from  a  modified  Zipf’s 
distribution. 

3.  Issue  pairs  of  extract- insert  operations  and  search  operations  following  an  activity 
ratio — the  ratio  of  search  to  extract-insert. 

4.  Record  the  time  consumed  during  the  operations  for  performance  measures. 

Key,  probability  selection,  and  operations  are  generated  offline.  Skewness  is  controlled  by 
the  variable  a,  which  yields  the  uniform  distribution  when  equal  to  0  and  Zipf’s  distribution 
when  equal  to  1.  In  experiments  using  this  benchmark,  a  varies  between  0  and  1.420 — in 
general,  only  the  extreme  values  are  interesting,  so  representative  results  are  shown  for  a 
equal  to  0,  1.058  (closest  to  Zipf’s),  and  1.420. 

As  with  the  experiments  with  the  classic  hold  model  described  in  section  4.3.1,  the 
search  benchmark  proceeds  in  two  phases  of  execution.  The  first  phase  (step  2  above) 
builds  the  map,  and  the  second  phase  (step  3)  modifies  and  searches  within  the  map.  In 
addition  to  the  established  search  parameters-activity  ratio  and  skewness — the  map’s  size 
(number  of  unique  integer  keys)  is  varied  to  stress  the  hardware’s  capacity.  Measurements 
of  execution  time  give  the  average  time  for  each  search,  extract,  and  insert  during  the 
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second  phase  of  execution. 

These  experiments  build  up  the  map  to  a  maximum  size  that  varies  between  64  and  2048 
by  powers  of  2,  and  executes  n  operations  (either  1000  or  4000  for  the  results  presented 
here)  during  phase  two.  An  update  is  counted  as  one  operation,  so  depending  on  the 
activity  ratio,  the  number  of  total  HWDS  instructions  varies  (from  1  *  n  to  1.8  *  n  where 
n  is  the  number  of  operations). 


Data  Structure  Size 


Data  Structure  Size 


(a)  0%  activity  ratio,  1000  oper-  (b)  50%  activity  ratio,  1500  op- 
ations,  a  =  0.0.  erations,  a  =  0.0. 


Data  Structure  Size 

(c)  80%  activity  ratio,  1800  oper¬ 
ations,  a  =  0.0. 


(d)  0%  activity  ratio,  1000  oper-  (e)  50%  activity  ratio,  1500  oper¬ 
ations,  a  =  1.420  ations,  a  =  1.420. 


(f)  80%  activity  ratio,  1800  oper¬ 
ations  a  =  1.420. 


Figure  5-1:  The  improvement  of  infinite  hardware  and  the  performance  of  software  map 
implementations . 


The  first  set  of  search  benchmarks  demonstrate  the  benefits  of  an  infinite-size  map 
HWDS  and  the  relative  performance  of  the  three  software-only  map  implementations. 
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Data  Structure  Size  Data  Structure  Size 


(a)  0%  activity  ratio,  1000  oper-  (b)  0%  activity  ratio,  1000  oper¬ 
ations,  a  =  0.  ations,  a  —  1.420. 


Figure  5-2:  Overflow  handling  using  the  extract-last  policy  of  priority  queues  with  a  128- 
node  map  HWDS  (HWMAP).  (Results  with  other  parameters  are  similar  to  Figure  5-2a.) 


Figure  5-1  shows  the  benefit  when  an  infinite-size  map  HWDS  is  used  with  and  without 
skewness  (a  =  0  and  1.420).  Figures  5-la,  5-lb,  and  5-lc  show  the  results  for  activity  ratios 
of  0%  (no  updates),  50%,  and  80%  respectively  with  a  =  0.  The  average  cost  with  infinite 
hardware  is  around  50  cycles  because  an  operation  involves  one  HWDS  instruction  and  the 
memory  accesses  to  load  the  code  and  data  for  the  operation.  These  charts  also  show  the 
relative  performance  of  the  software-only  map  implementations,  among  which  the  red-black 
tree  performs  best  in  most  cases,  followed  by  the  skip  list  then  splay  tree.  On  the  other 
extreme,  Figures  5-ld,  5-le,  and  5-le  shows  how  with  a  =  1.420,  the  performance  of  the 
infinite-size  HWDS  remains  the  same,  but  the  performance  of  the  software  implementations 
change.  These  results  suggest  that  the  splay  tree  does  better  when  the  map  is  read-mostly, 
and  the  red-black  tree  does  better  under  heavy  updates.  The  skip  list  never  outperforms 
the  trees  and  is  omitted  from  the  remainder  of  this  thesis. 


5.3.1  Overflow  handling  for  large  maps 


The  next  set  of  search  benchmark  experiments  establish  the  need  for  intelligent  manage¬ 
ment  of  overflow.  Figure  5-2  shows  how  overflow  handling  using  the  same  policy  as  a 
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priority  queue  HWDS — spilling  the  node  with  the  largest  key-  performs  poorly  as  the 
map  size  increases.  (Other  values  of  a  and  activity  ratio  are  similarly  bad  for  the  128-node 
HWDS.  The  overflow  data  structure  is  a  red-black  tree.) 


Data  Structure  Size 


Data  Structure  Size 


Data  Structure  Size 


(a)  0%  activity  ratio,  1000  oper¬ 
ations,  a  =  0.0. 


(b)  0%  activity  ratio,  1000  oper¬ 
ations,  a  =  1.058. 


(c)  0%  activity  ratio,  1000  oper¬ 
ations,  a  =  1.420 


Data  Structure  Size 


Data  Structure  Size 


Data  Structure  Size 


(d)  50%  activity  ratio,  1500  op¬ 
erations,  a  =  0.0. 


(e)  50%  activity  ratio,  1500  oper¬ 
ations,  a  =  1.058. 


(f)  50%  activity  ratio,  1500  oper¬ 
ations,  a  =  1.420. 


Figure  5-3:  Map  overflow  with  LRU  and  fill-after-search,  1000  search  operations. 


5.3.2  LRU  spilling  and  fill-after-search 

Figure  5-3  shows  that  an  LRU-based  map  HWDS  that  fills  nodes  found  during  a  failover 
search  can  handle  overflow  more  effectively  as  arguments  become  more  skewed  and  activity 
ratio  decreases.  This  result  is  not  surprising,  since  a  low  activity  ratio  means  more  search¬ 
ing,  for  which  LRU  should  be  effective,  and  the  more  skewness  in  the  search  arguments  the 
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more  temporal  locality  is  available  to  exploit. 


(a)  0%  activity  ratio,  4000  oper-  (b)  0%  activity  ratio,  4000  oper-  (c)  80%  activity  ratio,  7200  oper¬ 
ations,  a  =  0.0.  ations,  a  =  1.420.  ations,  a  =  1.420. 


Figure  5-4:  Map  overflow,  1024-node  HWDS,  4000  search  operations. 


Figure  5-4  shows  how  a  larger  HWDS  performs  with  a  larger  data  structure  size.  Per¬ 
formance  is  similar  between  the  128-  and  1024-node  HWDSs  at  a  given  ratio  of  HWDS 
capacity  to  data  structure  size,  except  for  the  read-mostly  skewed  search  which  benefits 
greatly  from  having  a  larger  HWDS  because  the  increased  capacity  enables  the  HWDS  to 
exploit  temporal  locality  better.  The  map  HWDS  outperforms  software  when  the  ratio  of 
HWDS  capacity  to  data  structure  size  is  less  than  1.5:1.  Note  that  these  results  use  4000 
search  operations  rather  than  1000;  the  number  of  operations  had  little  effect  on  search 
performance  at  this  scale,  although  further  experimentation  is  warranted  to  determine  if 
the  operation  count  affects  performance  at  larger  scales. 

Except  for  skewed  search-only  workloads,  the  map  HWDS  outperforms  software  only 
when  the  map  contains  fewer  than  50%  more  nodes  than  the  HWDS  capacity.  The  amount 
of  overflow  that  can  be  tolerated  is  much  less  than  with  the  priority  queue  HWDS,  and 
future  work  should  investigate  how  to  increase  the  amount  of  overflow  that  the  map  HWDS 
can  handle.  The  priority  queue  benefits  from  exploiting  structural  locality  in  the  united 
HWDS,  and  perhaps  a  similar  approach  can  be  developed  for  the  map  HWDS. 
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5.3.3  Eviction 


Figure  5-5  shows  how  an  eviction  upon  extract  HWDS  assignment  policy  can  help  to  curb 
performance  loss  when  overflow  causes  failover.  With  an  eviction,  the  performance  of  the 
interposition-based  HWDS  matches  closely  with  the  software-only  implementations.  These 
results  demonstrate  that  effective  software  support  can  yield  performance  that  achieves 
approximately  the  best  of  both  worlds. 


Data  Structure  Size  Data  Structure  Size  Data  Structure  Size 


(a)  50%  activity  ratio,  1500  op¬ 
erations,  a  =  0.0. 


(b)  50%  activity  ratio,  1500  op¬ 
erations,  a  =  1.058. 


(c)  50%  activity  ratio,  1500  oper¬ 
ations,  a  =  1.420. 


Data  Structure  Size 


Data  Structure  Size 


Data  Structure  Size 


(d)  80%  activity  ratio,  1800  op¬ 
erations,  a  =  0.0. 


(e)  80%  activity  ratio,  1800  oper¬ 
ations,  a  =  1.058. 


(f)  80%  activity  ratio,  1800  oper¬ 
ations,  a  =  1.420 


Figure  5-5:  Map  HWDS  overflow  handling  with  HWDS  assignment  to  software  upon  first 

EXTRACT. 
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5.3.4  Sharing  for  multiple  maps 


To  evaluate  sharing  map  HWDSs,  I  created  separate  search  benchmarks  and  placed  each 
within  its  own  task  with  a  task-private  map.  Each  search  benchmark  has  identical  pa¬ 
rameters  except  for  the  maximum  map  size:  each  map  has  a  maximum  size  exactly  half 
that  of  the  next  largest,  with  a  smallest  maximum  size  of  16.  Varying  the  maximum  size 
changes  which  maps  benefit  from  the  HWDS.  The  task  scheduler  is  preemptive  time-slicing 
round-robin  with  10  millisecond  time  slices. 


Data  Structure  Size  Data  Structure  Size  Data  Structure  Size 


(a)  0%  activity  ratio,  4000  oper¬ 
ations,  a  =  0.0. 


(b)  0%  activity  ratio,  4000  oper¬ 
ations,  a  =  1.058. 


(c)  0%  activity  ratio,  4000  oper¬ 
ations,  a  =  1.420. 


Data  Structure  Size 


Data  Structure  Size 


Data  Structure  Size 


(d)  80%  activity  ratio,  7200  op¬ 
erations,  a  =  0.0. 


(e)  80%  activity  ratio,  7200  op¬ 
erations,  a  =  1.058 


(f)  80%  activity  ratio,  7200  op¬ 
erations  a  =  1.420. 


Figure  5-6:  Multitasking  search  with  overflow  and  different-sized  priority  queues. 


Figure  5-6  shows  the  performance  of  an  infinite-size  map  HWDS,  software  maps  (splay 
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tree  and  red-black  tree),  and  a  limited-size,  128-node  map  HWDS  on  the  multitasking 
search  benchmark.  Each  power  of  two  over  128  causes  another  map  to  overflow;  at  256 
maximum  size,  only  the  one  task  using  a  map  of  that  size  suffers  performance  penalties  due 
to  overflow.  These  results  show  that  HWDSs  can  achieve  substantial  performance  gains 
versus  software,  and  that  as  search  becomes  more  skew,  the  splay  tree  performance  meets 
that  of  the  red-black  tree.  Also,  when  mixing  overflow  and  non-overflow  workloads,  the 
HWDS  can  still  perform  well  in  some  cases,  but  eventually  does  do  poorly  compared  with 
software-only  map  implementations. 


Data  Structure  Size 


Data  Structure  Size 


Data  Structure  Size 


(a)  0%  activity  ratio,  4000  oper¬ 
ations,  a  =  0.0. 


(b)  0%  activity  ratio,  4000  oper¬ 
ations,  a  =  1.058. 


(c)  0%  activity  ratio,  4000  oper¬ 
ations,  a  =  1.420. 


Data  Structure  Size 

(d)  80%  activity  ratio,  7200  op¬ 
erations,  a  =  0.0. 


Data  Structure  Size 

(e)  80%  activity  ratio,  7200  op¬ 
erations,  a  =  1.058 


Data  Structure  Size 

(f)  80%  activity  ratio,  7200  op¬ 
erations  a  =  1.420. 


Figure  5-7:  Multitasking  search  benchmarks  with  size-based  assignment  and  different-sized 
priority  queues. 
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Figure  5-7  shows  the  multitasking  search  benchmarks  with  a  HWDS  assignment  that 
uses  size  checks  to  avoid  using  the  HWDS  in  case  the  requested  size  of  the  map  data 
structure  exceeds  the  available  capacity  of  the  HWDS.  The  performance  of  the  HWDS 
with  size-checking  assignment  is  better  than  simply  using  software-only.  These  results 
show  that  applications  can  avoid  being  wasteful  with  a  HWDS  by  only  using  the  hardware 
resources  when  they  will  be  beneficial.  As  with  the  priority  queue  HWDS,  how  best  to  find 
the  best  ratio  of  map  HWDS  capacity  to  data  structure  size  is  an  open  question. 

5.4  Summary 

This  chapter  presented  a  map  HWDS  that  uses  the  same  basic  style  of  overflow  handling 
and  sharing  as  the  generic  HWDS  support  explicated  in  Chapter  3.  Enhancements  were 
proposed  and  evaluated,  including  LRU  spilling  and  filling,  dynamic  eviction,  and  size- 
based  HWDS  assignment  to  prevent  expensive  overflow.  Experimental  results  demonstrate 
the  viability  of  map  HWDSs  and  the  proposed  improvements.  Some  of  the  remaining  issues 
that  are  suitable  for  future  work  include  fleshing  out  the  map  HWDS  that  is  sketched  in 
Section  5.2.1,  implementing  and  evaluating  the  usefulness  of  the  change-value  operation, 
and  inventing  a  united  HWDS  for  searching. 
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Chapter  6  —  Shared  HWDSs  for  Hard  Real-Time  Systems 


Note:  Portions  of  this  chapter  were  previously  published  [17]. 

HWDSs  can  reduce  the  latency  and  jitter  of  data  structure  operations,  which  can 
benefit  real-time  systems  by  reducing  WCETs.  The  OS  support  for  overflow  handling  and 
sharing  proposed  in  this  thesis  permit  applications  to  benefit  from  HWDSs;  this  benefit  was 
demonstrated  in  Chapters  4  and  5.  However,  real-time  applications  have  different  execution 
requirements  than  general-purpose  applications.  This  chapter  explores  those  requirements 
using  a  priority  queue  HWDS,  presents  two  novel  algorithms  for  HWDS  assignment  in  a 
real-time  system,  and  evaluates  these  algorithms  with  synthetic  task  sets  and  benchmarks 
modeled  from  priority  queue  behavior  measured  in  two  applications  that  are  important 
in  real-time  and  embedded  domains:  the  grey-weighted  distance  transform  for  topology 
mapping  and  Dijkstra’s  algorithm  for  GPS  navigation.  Experimental  results  indicate  that 
HWDSs  can  reduce  the  WCET  of  applications  even  when  a  HWDS  is  shared  by  multiple 
data  structures  or  when  data  structure  sizes  exceed  HWDS  size  constraints. 

6.1  Real-time  Considerations  for  HWDSs 

Unlike  in  general-purpose  computing,  latency,  which  affects  predictability,  trumps  through¬ 
put  in  a  real-time  system. 

6.1.1  Overflow  handling 

For  real-time  systems,  the  execution  time  and  rate  of  overflow  and  underflow  exceptions 
is  important  because  those  two  parameters  affect  a  task’s  WCET  when  using  a  HWDS. 
Exception  handler  execution  time  depends  on  the  size  of  the  overflow  data  structure  and 
the  number  of  nodes  spilled  (equivalently  filled).  The  rate  of  exceptions  depends  on  two 


63 


factors:  the  rate  of  operations  and  the  number  of  nodes  spilled.  The  size  and  rate  of 
operations  are  application-dependent,  but  if  they  are  bounded  then  the  exception  WCET 
and  rate  depends  on  the  amount  of  work  done — the  number  of  nodes  spilled. 

Tuning  the  number  of  nodes  spilled  by  a  priority  queue  HWDS  to  be  any  number  k 
less  than  or  equal  to  half  of  the  HWDS  capacity  limits  the  number  of  exceptions  to  at 
most  one  overflow  and  one  underflow  per  k  operations.  In  any  window  of  k  operations 
the  worst  case  is  that  the  entire  HWDS  is  full  of  marked  nodes  and  a  peek  operation 
is  followed  by  an  enqueue.  The  peek  induces  an  underflow  exception  since  the  head  is 
marked.  The  underflow  handler  fills  the  HWDS  with  k  nodes  and  spills  at  least  k  nodes, 
leaving  the  HWDS  in  a  state  with  at  least  k  unmarked  nodes  and  possibly  marked  nodes 
in  the  remainder  of  the  HWDS.  The  HWDS  can  then  satisfy  at  least  k  operations  without 
another  underflow.  The  subsequent  enqueue  may  cause  an  overflow  exception  which  will 
spill  k  nodes.  At  this  point  the  HWDS  can  satisfy  at  least  k  operations  without  another 
overflow.  Tuning  the  handlers  to  spill  half  of  the  HWDS  size  minimizes  the  number  of 
exceptions  taken,  which  is  important  because  each  exception  that  gets  taken  adds  extra 
fixed  processing  overhead  to  invoke  the  handler. 

Failover  exceptions  are  frustrating  for  a  WCET  analysis  of  HWDSs.  In  this  work  on 
real-time  systems,  failover  is  not  allowed  to  happen  by  tight  control  of  application  software. 
(Additional  software  engineering  is  the  norm  in  real-time  system  development,  as  is  tight 
hardware-software  integration,  so  the  extra  control  is  not  an  unusual  burden  to  developers.) 
If  the  rate  of  failover  exceptions  is  bounded,  they  would  fit  in  a  WCET  analysis  similarly 
to  the  overflow  and  underflow  exceptions. 
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6.1.2  Sharing 


Sharing  is  handled  similarly  to  the  description  in  chapter  3,  but  with  two  adjustments. 
First,  the  HWDS  context  switch  tracks  how  many  nodes  it  saves,  and  refills  that  data 
structure  with  the  same  number  of  nodes.  This  adjustment  ensures  that  the  same  number 
of  nodes  are  present  in  the  HWDS  when  the  context  is  restored,  an  important  consideration 
for  bounding  the  cost  of  HWDS  context  switching.  Second,  a  task  is  only  permitted  to 
use  one  HWDS  context;  that  is,  only  one  of  any  given  task’s  priority  queues  may  use  the 
priority  queue  HWDS.  This  second  adjustment  aligns  the  HWDS  context  switch  with  the 
task  context  switch,  which  is  important  when  analyzing  a  task’s  WCET.  The  worst  case 
cost  of  a  HWDS  context  switch  is  when  the  HWDS  is  full  and  the  handler  is  refilling  from 
a  previously  full  HWDS  so  that  the  handler  spills  and  fills  the  entire  HWDS.  Similar 
to  overflow  handling,  the  cost  of  a  HWDS  context  switch  depends  on  the  overflow  data 
structure  size  and  implementation,  and  the  number  of  nodes  in  the  HWDS. 

6.2  Response  Time  Analysis 

A  HWDS  affects  task  response  time  by  decreasing  WCET  due  to  reducing  operation  la¬ 
tency,  but  exceptions  caused  by  overflow/underflow  conditions  increase  WCET.  Sharing  the 
HWDS  among  tasks  also  increases  the  response  time.  The  following  response  time  analysis 
evolves  a  standard  response  time  analysis  [11]  to  include  variables  that  affect  WCET  when 
using  a  HWDS.  This  analysis  only  considers  periodic  tasks. 

6.2.1  Notation 

•  r:  the  set  of  all  tasks 

•  Tt:  the  i’th  task 
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•  pi :  period  of  X) 


•  ep.  the  WCET  of  Tt. 

•  cp  the  maximum  context  switch  latency  of  X* 

Usually  Ci  is  equal  for  all  tasks  and  is  included  twice  in  ep  once  for  the  task  preempted  by 
T%  and  once  for  resuming  that  task. 


6.2.2  Standard  response  time  analysis 

The  response  time  Ri  of  T%  is  the  minimum  value  of  t  satisfying 

i— 1 

t  =  e.j  + 

k= 1 

Equation  6.1  considers  the  WCET  of  X*  plus  the  sum  of  processor  time  of  higher  priority 
tasks  overlapping  with  the  time  interval  t.  Ri  is  found  by  solving  the  recurrence 


t 

Pk 


e-k- 


(6.1) 


2  —  1 


f(m)  =  ei  + 


k= 1 


t(0 

Pk 


e-k 


starting  with  =  e,.  r  is  schedulable  if  Ri  <  pi  for  all  X)  G  r. 


6.2.3  Response  time  analysis  with  HWDSs 

Adding  HWDSs  splits  the  periodic  tasks  into  two  sets 

•  t:  the  set  of  tasks  using  a  HWDS 

•  r:  the  set  of  tasks  not  using  a  HWDS 

soT  =  fllf.  HWDS  assignment  is  the  problem  of  choosing  whether  to  place  T)  in  r  or  in 
r  for  every  i. 
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Task  response  times  depend  on  HWDS  assignment.  Each  task’s  WCET  is  now 


e,;  = 


e*  +  Xi  +  Ci  +  max  c7- 
j>i 


e, 


if  Ti  €  r 
otherwise 


where 

•  e)  is  the  WCET  of  Tj  when  the  HWDS  replaces  DS  operations 

•  Xi  is  the  cost  of  exceptions  taken  due  to  using  a  HWDS 

•  Ci  is  the  maximum  cost  to  context  switch  the  HWDS  for  Tj 

•  ei  is  the  WCET  of  Tj  using  a  software-only  DS 

Xi  depends  primarily  on  how  many  DS  operations  can  cause  exceptions  during  pi  (i.e. 
during  any  job  of  Tj)  and  the  time  needed  to  handle  the  exceptions:  because  depends 
on  the  HWDS  implementation  no  generic  formula  exists  for  et. 

S,  depends  on  Sj  for  j  >  i,  that  is  the  maximum  time  needed  to  empty  and  fill  the 
HWDS  of  a  lower  priority  task.  Preempting  a  lower  priority  task  j  empties  j’s  HWDS  and 
fills  z’s,  whereas  resuming  j  empties  V s  HWDS  and  fills  j’s. 

Equation  6.1  still  gives  R{  but  now  e7  depends  on  whether  Tj  £  r  or  not;  that  is,  on 
the  assignment  algorithm.  Assignment  for  just  one  task  depends  on  whether 


ej  ej  T  x  i . 


Assuming  that  afj  is  bounded  then  finding  the  Tj  that  maximizes 


ej  -  (Si  +  Xi) 


gives  the  task  that  will  benefit  most  from  using  the  HWDS. 
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Including  multiple  tasks  that  share  the  HWDS  complicates  the  assignment  problem.  In 
particular  c)  varies  depending  on  the  cost  of  emptying  and  filling  the  HWDS  (i.e.  a  context 
switch),  so — unlike  with  traditional  response  time  analysis — a  low  priority  task  can  affect 
the  response  time  of  higher  priority  tasks.  Conversely  higher  priority  tasks  already  affect 
the  response  time  of  lower  priority  tasks.  So  putting  any  Tj  into  r  necessitates  checking 
whether  it  negatively  affects  the  rest  of  the  tasks  already  in  r  in  order  to  find  an  optimal 
assignment  (see  Section  6.3). 

6.2.4  Response  time  analysis  with  a  priority  queue  HWDS 

When  using  a  priority  queue  HWDS,  the  costs  of  f)  and  £j  are  upper-bounded  as  follows. 

Let  S  be  the  size  of  the  HWDS.  Tuning  the  number  of  nodes  that  the  overflow  (under¬ 
flow)  exception  handler  spills  (fills)  to  be  re  <  S/2  guarantees  that  at  most  one  overflow 
(underflow)  exception  will  occur  for  every  w  priority  queue  operations  (enqueues  or  de¬ 
queues).  Let  Oi  be  the  maximum  number  of  operations  that  can  occur  for  any  job  of  T), 
and  let  A(w)  be  the  WCET  of  the  overflow  (underflow)  algorithm  to  handle  w  nodes.  Then 

x \  <  A(w)  *  \Oi/w].  (6.2) 

When  the  context  switch  invokes  the  overflow  routines  to  empty  the  HWDS  and  the 
underflow  routines  to  fill  it,  then  the  bound  on  c)  depends  on  how  much  of  the  HWDS  T) 
uses.  Let  si  <=  S  be  the  maximum  usage  of  the  HWDS  by  Tt.  Then 

Ci  <  A(Si)  *  si.  (6.3) 

For  example,  if  JVj  is  the  maximum  size  of  the  priority  queue  (i.e.  maximum  number  of 
overflow  nodes)  then  a  binary  heap  implementation  of  the  overflow  nodes  will  have  A(w)  ~ 
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w  *  log2  Ni  (approximating  the  WCET  of  the  heap  by  its  asymptotic  behavior).  Then  x) 
and  Ci  come  directly  from  Equations  6.2  and  6.3  respectively.  In  Section  6.4,  Software  and 
hardware  implementations  of  priority  queues  are  measured  for  the  WCET  of  their  enqueue 
and  dequeue  operations  and — for  HWDSs-  save-context,  restore-context,  spill,  and 
fill.  I  evaluate  HWDS  assignment  algorithms  with  those  measurements. 

6.3  HWDS  Assignment  for  Real-time  Systems 

I  use  terminology  from  scheduling  to  describe  HWDS  assignment  for  real-time  systems — 
indeed  the  assignment  problem  is  similar  to  the  problem  of  task  scheduling.  An  assignment 
is  feasible  if  a  solution  to  Equation  6.1  can  be  found  for  every  task  (equivalent  to  finding 
a  feasible  schedule).  If  an  assignment  algorithm  exists  that  produces  a  feasible  assignment 
for  a  set  of  tasks,  then  those  tasks  are  schedulable.  An  assignment  algorithm  is  optimal  if 
it  always  produces  a  feasible  assignment  for  a  set  of  tasks  when  one  exists. 

I  evaluate  four  assignment  algorithms  for  HWDSs:  software-only  assignment  (SOA), 
hardware-only  assignment  (HOA),  priority-aware  assignment  (PAA),  and  context  switch 
cost-aware  assignment  (CSCAA).  The  first  two  algorithms  are  naive  and  represent  two 
extremes,  and  the  latter  two  are  greedy  algorithms  employing  different  heuristics  to  make 
choices  about  when  to  use  a  HWDS.  None  of  these  algorithms  is  optimal,  and  the  PAA 
and  CSCAA  algorithms  do  not  permit  tasks  to  change  their  priorities. 

Some  aspects  of  these  algorithms  are  dependent  on  data  structure  behavior  in  particular 
on  the  WCET  of  HWDS  operations,  exceptions,  and  context  switches.  A  priority  queue 
HWDS  has  a  bounded  WCET  if  the  maximum  priority  queue  size,  maximum  number 
of  operations  per  period,  and  the  HWDS  size  are  bounded.  In  general  these  algorithms 
will  work  for  any  HWDS  that  has  bounded  WCET  based  on  the  data  structure  size  and 
operations.  If  a  HWDS  requires  more  information  to  bound  its  WCET,  then  new  algorithms 
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may  be  required.  Future  work  should  evaluate  the  difficulty  of  HWDS  assignment  and 
whether  efficient 

The  SOA  algorithm  simply  assigns  every  task  to  use  a  software-implemented  DS:  the 
SOA  algorithm  ignores  the  HWDS. 

The  HOA  algorithm  assigns  every  task  to  use  the  largest  possible  HWDS.  Usually  the 
largest  available  HWDS  gives  the  best  performance  out  of  all  the  available  HWDS  sizes, 
but  not  always.  As  the  usage  of  the  HWDS  increases,  the  rate  of  exceptions  should  go 
down  assuming  that  the  work  done  during  the  exception  handler  increases.  However  the 
latency  of  the  exception  handlers  will  increase,  and  so  will  the  HWDS  context  switch 
due  to  needing  to  move  more  data.  For  small  numbers  of  operations  per  period,  the 
larger  HWDSs  underperform  smaller  HWDSs;  at  small  counts  of  data  structure  operations 
software  typically  performs  better  than  any  HWDS. 

The  PA  A  algorithm  (Algorithm  1)  iterates  through  tasks  from  the  lowest  priority  to 
the  highest  priority  choosing  at  each  task  whether  to  use  the  HWDS  by  comparing  the 
WCET  of  the  software  implementation  with  the  WCET  of  the  HWDS.  This  algorithm 
tracks  the  maximum  HWDS  context  switch  of  the  tasks  that  it  has  assigned  to  the  HWDS 
so  that  it  can  compute  the  WCET  accurately  taking  into  account  the  context  switch  costs 
of  lower-priority  tasks.  Iterating  from  low  to  high  priorities  allows  the  algorithm  to  move 
in  one  direction.  The  reason  that  this  algorithm  is  not  optimal  is  that  higher-priority  tasks 
that  use  the  HWDS  have  a  WCET  that  depends  on  whether  (and  which)  lower-priority 
tasks  use  the  HWDS.  Because  the  algorithm  only  moves  in  one  direction,  it  does  not  allow 
for  re-evaluating  the  assignment  of  lower-priority  tasks,  and  therefore  can  miss  feasible 
assignments. 

CSCAA  (Algorithm  2)  is  similar  to  PAA  except  for  the  cost  heuristic  that  gets  added 
to  the  HWDS  WCET.  The  cost  heuristic  penalizes  low-priority  tasks  for  using  the  HWDS. 
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Algorithm  1:  Priority- Aware  Assignment  (PAA) 

Input:  n:  number  of  tasks,  r:  task  set,  N :  max  DS  sizes,  O:  max  DS  operations,  S :  max  HWDS  size 

1  T  =  0 

2  r  =  0 

3  Cm  —  0 

4  for  z  from  n  to  0  do 

5  el  =  get _hwds_wcet  (Ni,Oi,S,c^t) 

6  Si  =  S 

7  for  s  <  S  do 

8  =  get _hwds_wcet  (7Vj,Oj,s,c^) 

9  if  e  <  el  then 

10  el  =  e 

n  Si  =  s 

12  end 

13  end 

14  el  =  get_swds_wcet  ( Ni,0% ) 

15  if  el  <  el  then 

16  add_to_set  (r,  Tj) 

17  if  cl  >  cffi  then 

18  Cm  —  Cj 

19  else 

20  add_to_set  (r,  Tj) 

21  end 

22  return  t,t 


This  heuristic  tries  to  offset  the  effect  of  lower-priority  tasks  on  higher-priority  tasks.  In 
particular,  the  WCET  of  high-priority  tasks  affects  low-priority  task  response  times,  so 
reducing  high-priority  task  WCETs  should  benefit  response  times  for  a  set  of  tasks.  Of 
course,  the  penalty  may  prevent  low-priority  tasks  from  using  the  HWDS  when  they  could 
(and  should),  so  this  algorithm  can  miss  feasible  assignments.  The  cost  heuristic  can  be 
any  function  that  gives  a  penalty  to  a  task  that — if  it  uses  the  HWDS — would  increase 
the  maximum  HWDS  context  switch  time  compared  to  tasks  with  a  lower  priority.  For 
this  work,  I  used  a  cost  heuristic  that  multiplies  the  amount  a  task  will  increase  the 
maximum  HWDS  context  switch  latency  times  the  number  of  tasks  with  a  higher  priority: 
In  Algorithm  2  the  function  get_cost  returns  (c*  —  cm)  *{n  —  i)  or  0,  whichever  is  greater. 
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Algorithm  2:  Context  Switch  Cost-Aware  Assignment  (CSCAA) 

Input:  n:  number  of  tasks,  r:  task  set,  N :  max  DS  sizes,  O:  max  DS  operations,  S\  max  HWDS  size 

1  T  =  0 

2  r  =  0 

3  Cm  —  0 

4  for  z  from  n  to  0  do 

5  cl  =  get _hwds_wcet  (Ni,Oi,S,c^t) 

6  Si  =  S 

7  for  s  <  S  do 

8  e  =  get Jiwds.wcet  (A^ ,Oj,s,c^) 

9  if  e  <  then 

10  el  =  e 

n  Si  =  s 

12  end 

13  end 

14  =  get_swds_wcet  ( Ni,Oi ) 

15  if  +  get.cost  j  e%  then 

16  add_to_set  (r,  Tj) 

17  if  q  >  then 

18  C771  =  Ci 

19  else 

20  add_to_set  (r,  Ti) 

21  end 

22  return  t,t 


6.4  Experiments 

I  conducted  a  series  of  experiments  to  evaluate  HWDSs  in  the  context  of  hard  real-time 
systems.  These  experiments  use  a  priority  queue  HWDS,  synthetic  task  sets  to  explore 
the  parameter  space  of  the  HWDS  as  the  parameters  relate  to  WCET,  and  workloads  that 
approximate  real-world  applications.  Experiments  are  conducted  using  the  experimental 
infrastructure  described  in  Section  3.3. 

I  measured  values  for  WCET  parameters  that  underlie  all  of  the  following  experiments. 
To  estimate  the  WCET  of  priority  queue  operations  I  implemented  an  implicit  binary  heap 
as  a  representative  software  priority  queue.  I  designed  a  series  of  measurement  tests  that 
build  a  priority  queue  up  to  a  specified  size,  and  then  measure  the  cost  of  an  operation 
at  that  size.  Five  specific  events  are  measured  in  isolation:  enqueue,  dequeue,  overflow 
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exception,  underflow  exception,  and  HWDS  context  switch.  The  latter  three  are  only  rele¬ 
vant  and  measured  for  a  HWDS.  All  caching  is  disabled  to  obtain  the  WCET  of  these  five 
events.  Although  these  measurements  are  pessimistic,  the  lack  of  a  time-predictable  cache 
is  problematic.  As  a  result,  memory  access  latency  dominates  the  WCET  measurements. 

To  force  the  worst-case  conditions  for  the  software  priority  queue,  measure  an  enqueue 
of  a  node  with  priority  less  than  the  highest-priority  node  in  the  heap  so  that  the  enqueue 
must  move  the  new  node  to  the  top  of  the  heap  resulting  in  a  maximum  number  of  swaps 
(equal  to  the  log  base-2  of  the  priority  queue  size).  A  dequeue  of  the  minimum  value  causes 
a  maximum  amount  of  work  in  a  heap. 

For  the  HWDS  enqueue  and  dequeue  WCET,  the  HWDS  must  be  in  a  state  that  will 
not  cause  an  exception.  Before  measuring  enqueue,  ensure  the  HWDS  has  enough  spare 
capacity  to  accept  the  new  node,  and  before  measuring  dequeue  ensure  at  least  one  valid 
node  is  at  the  head  of  the  queue.  To  generate  the  WCET  overflow,  the  nodes  that  get 
spilled  must  cause  the  spill  algorithm  to  do  maximum  work.  Using  the  united  HWDS 
described  in  Section  4.2,  spilling  iterates  from  the  tail  of  the  linked  list  to  the  head  (which 
has  highest  priority);  to  cause  the  WCET  overflow,  empty  the  HWDS  and  then  fill  it  with 
new  nodes  that  have  priority  less  than  the  head  of  the  overflow  linked  list,  thus  ensuring 
that  the  spill  algorithm  iterates  through  the  entire  linked  list  before  completing. 

The  underflow  handler  has  a  special  condition  under  which  it  has  to  spill  nodes;  when 
the  HWDS  is  full  of  marked  nodes,  it  must  fill  from  the  spilled  nodes  and  also  spill  some 
of  its  marked  nodes.  The  worst-case  condition  of  an  underflow  is  generated  by  enqueueing 
nodes  with  priority  less  than  the  head  of  the  spilled  nodes  (as  with  the  overflow  case), 
marked  all  nodes  in  the  HWDS,  and  then  issued  a  dequeue.  The  dequeue  causes  an 
underflow,  and  the  exception  handler  finds  that  no  capacity  exists  to  fill,  so  it  spills  nodes. 
The  spills  will  take  maximum  time  because  the  handler  spills  nodes  with  higher  priority 
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than  the  nodes  already  in  the  spill  data  structure.  The  underflow  handler  eventually  fills 
the  HWDS. 

To  cause  the  WCET  of  the  HWDS  context  switch,  fill  the  HWDS  to  its  maximum 
size  using  two  separate  data  structures  while  ensuring  the  HWDS  contains  nodes  with 
priority  less  than  the  head  of  the  spilled  nodes.  Then  cause  a  HWDS  context  switch  by 
issuing  an  operation  for  the  priority  queue  that  is  not  currently  in  the  HWDS  context.  The 
context  switch  handler  spills  all  of  the  nodes  in  the  HWDS,  which  (because  of  the  ordering 
of  nodes)  takes  maximum  time,  and  then  fills  the  HWDS  with  nodes  from  the  requested 
priority  queue’s  overflow  data  structure. 

6.4.1  Schedulability 

I  designed  a  series  of  experiments  using  synthetic  task  sets  to  characterize  the  HWDS 
parameter  space  and  evaluate  the  HWDS  assignment  algorithms  A  task  set  is  started 
by  creating  a  set  of  n  tasks  choosing  integer  task  periods  pi  uniformly  from  [1,1000]. 
Choose  task  utilizations  u?;  uniformly  at  random  from  [0.001, 1)  implicitly  selecting  task 
execution  times  e*.  After  assigning  all  n  tasks  a  utilization,  normalize  each  m  so  that 
Y17=o  ui  =  U,  where  U  is  some  target  utilization  value.  This  method  of  generating  tasks 
provides  a  variety  of  task  sets  while  controlling  the  number  of  tasks  and  the  task  set 
utilization.  Use  response  time  analysis  (Equation  6.1)  to  ensure  the  generated  task  set  is 
schedulable,  and  regenerate  any  sets  that  fail  the  schedulability  test.  Then  modify  each 
generated  task  set  to  include  priority  queue  operations  parametrized  by  a  max  priority 
queue  size,  max  HWDS  size,  priority  queue  implementation,  and  number  of  operations 
to  complete  in  a  period.  Using  the  task’s  period  and  utilization,  calculate  compute  time 
and  add  the  WCET  determined  by  the  priority  queue  parameters.  Priority  queue  size  and 
implementation  determine  the  WCET  for  any  operation,  and  the  priority  queue  size  with 
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the  number  of  operations  determines  the  WCET  for  the  HWDS  exceptions.  The  HWDS 
and  priority  queue  sizes  determine  the  WCET  for  the  HWDS  context  switch. 

The  parameters  of  max  priority  queue  size,  priority  queue  implementation,  and  number 
of  operations  are  varied  in  a  controlled  way.  For  each  particular  assignment  of  parameters, 
generate  10000  task  sets  and  attempted  to  assign  priority  queue  usage  for  each  task  set 
using  all  four  of  the  algorithms  (SOA,  HOA,  PAA,  and  CSCAA)  presented  in  Section  6.3. 
For  each  task  set  and  assignment  algorithm,  determine  whether  the  task  set  is  schedulable 
after  priority  queue  assignment.  For  these  experiments,  I  set  the  max  HWDS  size  at  1024 
and  let  PAA  and  CSCAA  choose  to  limit  individual  tasks  to  a  smaller  size;  in  practice 
these  algorithms  typically — but  not  always — use  the  largest  possible  HWDS  size. 


Percent  Schedulable  with  90.0%  Threshold 
Utilization:  0.6,  Tasks:  8,  Max  HWPQ  Size:  1024 


Software-Only 


Hardware-Only 


PQ  Ops  per  Period 


Priority-Aware 


PQ  Ops  per  Period 


Context  Switch  Cost- Aware 


PQ  Ops  per  Period 


PQ  Ops  per  Period 


Figure  6-1:  Schedulability  of  random  task  sets  for  utilization  (without  priority  queue  op¬ 
erations)  fixed  at  0.6  and  task  set  size  at  8.  Varying  utilization  and  the  number  of  tasks 
moves  the  threshold  lines,  which  are  shown  in  later  figures. 
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Figure  6-1  shows  the  results  as  both  the  max  priority  queue  size  and  the  number  of 
priority  queue  operations  per  period  vary  by  powers  of  2  from  16  to  8192.  For  this  particular 
figure,  the  task  set  utilization  U  is  0.6  and  the  number  of  tasks  per  task  set  to  8.  The  plot 
shows  the  percent  of  task  sets  (out  of  10000)  that  are  schedulable  after  assignment  for  each 
combination  of  priority  queue  size  and  number  of  priority  queue  operations.  The  threshold 
line  plot  delineates  an  upper  limit  below  which  each  combination  feasibly  schedules  at 
least  90%  of  its  task  sets.  These  results  show  how  the  different  assignment  algorithms 
work,  and  in  particular  show  that  PA  A  dominates  SOA  and  HO  A  for  much  of  the  explored 
space.  The  threshold  line  also  shows  that  differences  exist  between  the  schedulability  of 
task  sets  assigned  using  PAA  versus  CSCAA,  with  neither  outperforming  the  other  for  all 
parameters  although  CSCAA  generally  does  better  than  PAA. 

Threshold  for  90.0%  Schedulability  Threshold  for  90.0%  Schedulability 


(a)  Schedulability  with  U  =  0.4.  (b)  Schedulability  with  U  =  0.8. 

Figure  6-2:  As  utilization  decreases  (increases),  threshold  lines  move  up  (down)  because 
applications  have  more  (less)  spare  utilization  to  accommodate  priority  queue  operations. 

Figure  6-2a  shows  just  the  threshold  lines  this  time  for  a  task  set  utilization  U  at  0.4,  and 
again  with  the  tasks  fixed  at  8;  Figure  6-2b  shows  how  increasing  U  affects  schedulability 
by  measuring  schedulability  with  U  at  0.8  and  with  8  tasks.  When  system  utilization  is 
low  the  extra  slack  available  in  the  system  allows  for  priority  queue  operations  to  use  more 
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time,  which  leads  to  more  task  sets  being  schedulable.  In  general,  the  threshold  lines  move 
up  indicating  that  for  a  given  number  of  priority  queue  operations,  the  task  sets  having 
priority  queue  sizes  twice  as  large  are  schedulable  over  90%  of  the  time  with  the  extra  20% 
available  CPU  time. 

Threshold  for  90.0%  Schedulability  Threshold  for  90.0%  Schedulability 


(a)  Schedulability  with  4  tasks.  (b)  Schedulability  with  16  tasks 

Figure  6-3:  As  the  number  of  tasks  decreases  (increases),  the  threshold  lines  move  up 
(down).  Halving  (Doubling)  the  number  of  tasks  more  than  doubles  (halves)  the  number 
of  schedulable  task  sets. 

Figure  6-3a  again  shows  the  threshold  lines,  this  time  with  U  at  0.6  and  with  4  tasks; 
Figure  6-3b  shows  how  increasing  the  number  of  tasks  with  fixed  U  affects  schedulability 
by  keeping  U  at  0.6  and  increasing  the  number  of  tasks  to  16.  The  extra  tasks  increase 
the  global  number  of  priority  queue  operations  (since  every  task  does  the  same  workload). 
Doubling  the  tasks  has  the  effect  of  reducing  by  a  factor  of  two  the  priority  queue  sizes  of 
tasks  sets  that  are  schedulable  at  least  90%  of  the  time  for  a  given  number  of  operations 
(two  factors  if  compared  to  half  as  many  tasks  and  20%  more  CPU  time). 

6.4.2  Real-world  Applications 

The  synthetic  task  sets  demonstrate  priority  queue  HWDSs  with  the  PAA  and  CSCAA 
algorithms  can  decrease  utilization  hence  increase  schedulability  of  applications  that  use 
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priority  queues.  This  section  shows  how  HWDSs  might  benefit  real-world  applications, 
which  may  not  exhibit  behavior  that  is  similar  to  the  synthetic  task  sets.  Two  important 
application  domains  in  real-time  and  embedded  systems  are  navigation  and  terrain  map¬ 
ping.  Both  of  these  domains  contain  applications  that  use  a  priority  queue  as  a  central 
data  structure  in  their  main  algorithms.  From  the  navigation  domain,  I  use  a  version  of 
Dijkstra’s  algorithm  that  is  executed  on  real-world  maps  taken  from  the  DIMACS  shortest 
path  implementation  challenge  benchmarks  [26].  From  the  terrain  mapping  domain,  I  use 
an  implementation  of  the  grey-weighted  distance  transform  that  executes  on  a  random  3D 
image;  this  application  has  been  used  previously  to  evaluate  a  variety  of  software  priority 
queues  [82].  I  call  these  applications  GPS  and  GWDT  respectively.  Both  applications  and 
their  inputs  are  available  online,  see  [82,  26]. 

In  order  to  simulate  these  real-world  applications,  I  measured  their  behavior  with  re¬ 
spect  to  PQ  parameters  that  affect  HWDS  WCET.  These  measurements  are  the  same  as 
those  used  in  Section  4.3.2,  where  the  methodology  for  taking  measurements  is  explained 
for  the  GPS  application.  Table  6-1  summarizes  the  measurements.  For  the  GWDT  appli¬ 
cation,  I  included  the  peek,  enqueue,  and  dequeue  operations  with  priority  queue  memory 
management;  the  software  priority  used  for  the  measurements  was  the  4- heap  [82], 

Using  the  parameters  measured  from  running  the  applications,  I  modeled  two  new 
applications  that  simultaneously  run  x  numbers  of  small  (32  pixel)  GWDT  tasks,  y  numbers 
of  local  GPS  search  tasks,  1  large  (64  pixel)  image  processing  task,  1  regional  GPS  search 
task,  and  1  long-distance  GPS  search  task.  One  application  lets  x  vary  from  0  through 
12  with  y  fixed  at  1  (call  it  the  GWDT  application),  and  the  other  application  lets  y  vary 
from  0  through  12  with  x  fixed  at  1  (call  it  the  GPS  application).  The  total  number  of 
tasks  in  either  application  varies  from  4  to  16. 

For  each  application  at  a  given  number  of  tasks,  10000  random  task  sets  are  generated 
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Table  6-1:  Priority  queue  behavior  in  real-world  applications. 


App. 

Input 

Priority  Queue  Size 

Operations 

Time 

GWDT 

32  pixels 

16303 

168840 

31.4% 

64  pixels 

56447 

1353326 

33.5% 

NYC 

925 

528693 

28.5% 

S.F.  BAY 

886 

642540 

27.1% 

Colorado 

945 

871332 

30.1% 

Florida 

1413 

2140753 

28.4% 

NW  US 

1723 

2415891 

29.2% 

GPS 

NE  US 

1796 

3048907 

26.7% 

California 

2355 

3781631 

27.4% 

Great  Lakes 

1810 

5516239 

27.9% 

Eastern  US 

2336 

7197247 

24.6% 

Western  US 

4281 

12524209 

24.3% 

Central  US 

5086 

28163632 

22.4% 

with  the  utilization  drawn  randomly  as  before  (uniform  in  [0.001, 1]  then  normalized  to 
a  target  U  after  all  tasks  have  a  utilization),  but  now  with  the  period  determined  by 
the  measured  priority  queue  parameters.  In  particular,  the  WCET  of  a  software  priority 
queue  is  determined  (using  measurements  from  the  implicit  binary  heap)  for  the  maximum 
priority  queue  size  and  number  of  priority  queue  operations  for  the  task,  and  uses  the 
percent  of  time  the  task  should  spend  on  the  priority  queue  to  determine  how  long  its  total 
compute  time  should  be.  Then  the  task’s  period  is  computed  by  dividing  its  total  compute 
time  by  its  randomly  generated  utilization.  Any  task  set  that  does  not  pass  the  response 
time  analysis  is  regenerated. 

The  result  of  task  set  generation  is  a  set  of  tasks  that  use  a  software  priority  queue  and 
whose  task  set  has  a  utilization  equal  to  a  known  value  U.  The  software  priority  queue 
WCETs  is  then  removed  from  the  tasks  and  run  each  assignment  algorithm  (SOA,  HOA, 
PAA,  and  CSCAA)  on  the  task  set.  The  SOA  algorithm  will  result  in  a  schedulable  task  set 
with  a  utilization  equal  to  U.  Instead  of  using  schedulability  as  the  metric  for  performance 
in  these  experiments,  the  amount  the  assignment  algorithm  improves  (or  degrades)  task  set 
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utilization  is  used;  an  improvement  in  utilization  is  a  positive  number,  so  larger  is  better, 
and  negative  numbers  indicate  that  the  assignment  algorithm  does  worse  than  SOA. 


Difference  in  Utilization  from  SOA  (0.7)  for  New  York  City 
Max  HWPQ  Size  1024 
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0.15 
0.1 
0.05 
0 

-0.05 
-0.1 
-0.15 
-0.2 

2  4  6  8  10  12  14  16 


| 

I 


Difference  in  Utilization  from  SOA  (0.7)  for  New  York  City 
Max  HWPQ  Size  1024 


Number  of  Tasks 


Number  of  Tasks 


(a)  Utilization  improvements  for  GPS. 


(b)  Utilization  improvements  for  GWDT. 


Figure  6-4:  Utilization  improvements  with  increasing  numbers  of  tasks  executing  local 
search  in  NYC  or  small  input  for  GPS  and  GWDT  applications  respectively. 


Figure  6-4a  shows  how  HOA  and  CSCAA  improve  utilization  over  SOA  for  the  ap¬ 
plication  that  varies  the  number  of  tasks  running  a  local  GPS  search;  each  point  is  the 
arithmetic  mean  of  the  difference  between  the  utilization  of  SOA — fixed  at  0.7 — and  one 
of  the  assignment  algorithms  (either  HOA  and  CSCAA)  averaged  across  10000  trials,  and 
with  error  bars  showing  the  sample  standard  deviation  in  both  directions  (one  standard 
deviation  up  and  one  down).  The  local  GPS  search  is  executing  the  benchmark  challenge 
for  New  York  City,  with  the  regional  and  long-range  searches  executing  the  northeastern 
US  and  eastern  US  benchmarks  respectively.  Figure  6-4b  shows  the  same  measurements 
but  taken  as  the  number  of  tasks  running  the  small  (32  pixels)  GWDT  (32  pixels)  in¬ 
creases.  The  results  for  PAA  are  not  shown  because  they  overlap  closely  with  those  for 
CSCAA.  The  gains  for  the  GPS  application  are  around  10-16%  utilization  which  represents 
an  improvement  of  14-22%  over  the  software  PQ  utilization. 

The  real-world  applications  demonstrate  some  interesting  results.  First  is  that  just 
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using  a  HWDS  (HOA)  yields  large  swings  in  utilization;  the  smallest  GWDT  task  has 
a  standard  deviation  of  around  7%  utilization.  Second  is  that  for  some  applications  the 
benefit  of  using  HWDS  may  actually  increase  as  the  number  of  tasks  increases;  conversely 
the  benefits  may  decrease,  as  shown  by  the  GWDT  results.  Even  so,  the  CSCAA  algorithm 
produces  useful  HWDS  assignments  in  these  real-world  task  sets  and  improves  task  set 
utilization,  which  enables  real-time  developers  to  schedule  more  hard  real-time  tasks.  The 
extra  utilization  also  could  be  useful  for  admission  control  of  sporadic  and  aperiodic  tasks. 


6.5  Summary 

This  chapter  demonstrated  that  HWDSs  can  benefit  real-time  systems  by  reducing  WCETs 
even  when  data  structure  sizes  exceed  the  size  of  the  HWDS.  Systems  software  support 
provides  flexibility  to  remove  size  and  sharing  limitations  of  hardware  so  that  applications 
can  benefit  from  using  HWDSs.  I  devised  two  new  algorithms  that  assign  tasks  to  use  either 
a  HWDS  or  a  software-implemented  data  structure,  and  experimental  results  show  those 
algorithms  outperform  just  using  the  software  or  just  using  the  HWDS  for  much  of  the 
explored  application  and  parameter  space.  A  priority  queue  HWDS  shows  how  real-world 
applications  for  navigation  and  image  processing  could  obtain  practical  improvements  in 
the  range  of  5-15%  of  total  utilization  using  the  intelligent  approaches  to  overflow  handling 
and  HWDS  assignment  proposed  in  this  thesis. 
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Chapter  7  —  Future  Work  and  Conclusion 


Before  I  conclude  my  dissertation  thesis,  this  chapter  identifies  some  of  the  possible  direc¬ 
tions  for  future  work. 

7.1  Policies  for  Accessing  Memory 

Decades  of  research  on  caching  has  explored  policies  to  improve  performance:  penalty- 
reducing  algorithms  like  critical- word  first,  early  restart,  read  prioritization,  and  write 
merging;  miss-reducing  techniques  like  hit  under  miss,  hardware  prefetching,  and  cache 
pinning;  reducing  access  latency  with  virtual  addressing,  cache  sizing,  and  pipelining;  and 
eviction  algorithms  like  LRU,  least  frequently  used,  and  victim  caching.  Parallels  to  these 
improvements  may  exist  for  HWDSs,  since  they  too  provide  an  interface  to  memory.  The 
solutions  likely  differ,  because — as  demonstrated  by  the  overflow  handling  for  the  prior¬ 
ity  queue  united  HWDS — overflow  data  structure  implementation  may  affect  policy  and 
algorithm  performance. 

7.2  HWDS  Assignment 

This  thesis  shows  that  assignment  algorithms  make  a  difference.  Better  algorithms,  both 
static  (offline)  and  dynamic  (online),  certainly  exist  for  HWDS  assignment,  and  evaluating 
their  complexity  and  effectiveness  is  a  promising  avenue  for  future  research.  An  important 
open  question  related  to  HWDS  assignment  is  what  size  HWDS  should  a  data  structure  of 
a  given  size  be  assigned. 
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7.3  Data  Sharing 


Nothing  prevents  tasks  from  sharing  a  HWDS  with  the  same  data  structure.  However  such 
sharing  imposes  two  new  requirements  on  the  HWDS:  synchronization  and  protection.  The 
synchronization  and  protection  of  shared  data  are  well-studied  problems.  Synchronization 
is  solved  with  mechanisms  such  as  locking  and  TM.  Protection  is  provided  by  OS  support 
for  private  address  spaces  and  shareable  regions  within  those  spaces;  for  example,  shared 
pages  in  a  page-based  VMA  space.  This  thesis  supports  only  task-private  data  structures, 
so  synchronization  is  not  a  problem,  and  protection  is  implicit. 

HWDSs  may  prove  beneficial  for  data  sharing,  because  the  hardware  could  implement 
its  own  synchronization  primitives.  Protection  does  not  seem  problematic,  because  task 
context  can  include  HWDS  context,  in  which  case  the  OS  only  needs  know  which  HWDS 
contexts  a  task  may  access.  A  deeper  study  of  data  sharing  is  needed  to  test  these  hy¬ 
potheses,  but  the  idea  seems  promising. 

7.4  OS  Optimizations  for  HWDSs 

HWDS  exceptions  and  hardware  performance  counters  offer  new  knowledge  about  data 
structure  usage  that  the  OS  might  use  advantageously.  Some  optimizations  to  investigate 
include  deferring  exception  handlers,  lazy  HWDS  context  switching,  co-scheduling  tasks 
that  share  a  data  structure,  avoiding  preemption  for  a  minimum  time  after  a  HWDS  context 
switch,  pinning  data  structures  to  the  HWDS  for  high  priority  tasks,  prefetching  HWDS 
context  for  tasks  near  the  front  of  the  scheduler’s  ready  queue,  and  using  the  overflow  data 
structure  when  it  is  already  in  cache.  All  of  these  optimizations  have  potential  to  improve 
the  performance  of  multitasking  systems  using  HWDSs. 
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7.5  Integration  with  Programming  Languages  and  Libraries 


From  a  programmer’s  perspective,  libraries — like  STL  —and  the  OS  could  use  HWDSs 
independent  of  applications  to  replace  the  use  of  data  structures.  Compilers  can  play  a  role 
in  effective  HWDS  use  as  well.  Code  generation  and  optimization  for  HWDS  instructions 
is  an  open  area  of  research. 

Object-oriented  languages  support  abstractions  and  libraries  for  the  data  structures  and 
operations  used  throughout  this  thesis:  the  C++  STL  provides  priority  queue  and  map 
containers — a  container  is  the  STL  equivalent  of  an  abstract  data  type.  STL  containers 
implement  data  structure  operations  as  member  functions  of  the  container  template  class. 
Some  of  these  functions  already  are  supported  by  HWDSs  to  a  limited  extent — for  example 
insert,  erase,  find,  begin,  push_front,  and  pop_front.  Other  functions  can  be  provided  with 
trivial  hardware  modifications,  for  example  hardware  counters  can  implement  functions 
related  to  capacity  such  as  size,  max_size,  and  empty.  The  difficulty  in  providing  the 
remaining  functions  is  that  either  the  hardware  support  is  non-trivial  or  the  nodes’  values 
must  be  accessed,  which  requires  knowledge  about  the  object  layout.  Also  unclear  is  how 
to  handle  iterators,  which  permit  applications  to  retain  handles  to  the  container.  Is  the  gap 
between  the  HWDS  interface  and  common  library  interfaces  such  as  the  STL  bridgeable? 
Is  a  light-weight  portability  interface  that  can  span  multiple  languages  and  libraries  to 
support  HWDSs  a  feasible  and  practical  solution  for  widespread  deployment?  Do  software 
library  interfaces  reduce  or  increase  the  appeal  of  the  HWDS  as  an  abstraction  layer? 
These  questions  open  new  directions  to  explore  library  and  language  support  for  HWDSs. 
Appendix  A  describes  some  initial  steps  along  those  directions. 
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7.6  Hardware  Improvements 

This  thesis  focuses  on  the  OS  side  of  the  hardware-software  interface  of  HWDSs,  and 
how  software  improvements  along  this  interface  help  applications  to  use  HWDSs  better. 
Investigations  along  the  hardware  side  of  the  interface  may  yield  benefits  for  applications 
as  well;  this  section  identifies  possible  directions  to  investigate. 

7.6.1  Other  HWDSs 

The  most  obvious  hardware  improvement  is  support  for  more  data  structures.  What  other 
data  structures  are  amenable  to  HWDS  implementation?  Kim  [67]  identifies  the  sparse 
vector  and  hash  table  as  possibilities  for  abstract  datatype  processors,  which  are  closely 
related  to  HWDSs.  Graphs  [85,  41]  and  trees  [117]  have  been  implemented  using  RC 
co-processors. 

7.6.2  Improved  processor  pipeline  support 

This  thesis  uses  a  HWDS  functional  unit  that  operates  atomically  and  non-speculatively. 
Since  the  rate  of  HWDS  instructions  usually  is  slow,  these  restrictions  are  not  oppressive. 
However,  some  heavy  uses  of  HWDSs  could  be  more  efficient  if  the  functional  unit  were  able 
to  operate  in  parallel  and  speculatively  with  the  rest  of  the  pipeline,  and  if  the  functional 
unit  itself  could  be  pipelined.  An  example  of  such  a  use  is  in  overflow  handling  and  context 
switching,  which  execute  repeated  spill  or  fill  instructions. 

7.6.3  HWDS  support  for  instructions 

The  HWDSs  presented  in  this  thesis  implement  the  spill  and  fill  instructions  with 
combinations  of  other  instructions,  real  and  imagined.  What  if  the  HWDS  supported  spill 
and  fill  natively?  The  complexity  of  OS  management  would  lessen,  since  the  interface  to 
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the  HWDS  would  be  cleaner,  albeit  slightly  larger.  More  important,  the  hardware  would 
be  responsible  for  providing  the  most  efficient  mechanisms  for  getting  data  in  and  out. 
One  step  further,  perhaps  the  hardware  can  implement  spill  and  fill  to  access  memory 
directly.  Then  the  processor  can  be  freed  to  do  other  work  until  the  HWDS  is  finished. 
Such  support  would  permit  HWDS  events  such  as  context  switching  and  overflow  handling 
to  work  asynchronously  and  hide  much  of  the  overhead  induced  by  those  events. 

Another  intriguing  possibility  for  HWDS  instructions  is  for  the  hardware  to  convert 
failover  operations  directly  into  overflow  data  structure  operations.  Such  conversion  would 
permit  failover  to  happen  asynchronously,  permitting  the  processor  or  HWDS  to  do  other 
work.  Such  a  hardware  improvement  is  reminiscent  of  stored  microprograms  [75,  86]  and 
instruction  fusion  [29]. 

Yet  another  mechanism  for  efficient  filling  would  be  to  use  the  idea  of  paired  operations 
proposed  by  Leiserson  [77].  An  example  of  a  paired  operation  is  a  [dequeue,  enqueue];  when 
overflow  exists,  a  dequeue  can  be  paired  implicitly  with  an  enqueue  from  the  overflow 
nodes.  In  a  priority  queue  that  only  reads  from  the  head  of  the  queue,  such  a  paired 
operation  has  the  potential  to  eliminate  underflow. 

7.6.4  Prefetching 

Prefetching  is  known  to  decrease  (and  sometimes  increase)  cache  miss  rates.  The  structural 
locality  embedded  in  a  HWDS  seems  perfect  for  implementing  a  linked  prefetcher.  The 
HWDSs  used  in  this  thesis  only  store  key-value  pairs;  in  practice,  values  are  likely  pointers 
to  structured  data  that  an  application  uses.  Prefetch  logic  could  load  the  data  pointed 
to  by  the  values  for  nodes  in  the  HWDS.  Evaluations  for  prefetching  support  necessarily 
must  consider  the  cost  of  prefetching,  which  is  increased  memory  bus  pressure  and  the 
possibility  of  increased  miss  rates  due  to  cache  evictions  caused  by  prefetched  cache  lines. 
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7.6.5  Multicore  considerations 


This  thesis  considered  the  integration  of  HWDSs  in  a  uniprocessor  computer  architecture. 
Modern  systems  increasingly  rely  on  multiprocessing,  in  particular  chip  multicore  multi¬ 
processing,  which  warrants  further  investigations  into  how  HWDSs  should  be  accessed  by 
hardware.  Similar  problems  as  caching — coherency,  scalability,  sharing,  and  hierarchy — 
may  appear  in  such  investigations.  The  combination  of  data  sharing  and  multicore  is 
appealing  to  study  with  HWDSs;  multicore  processors  increase  contention  on  shared  data, 
and  if  a  HWDS  can  manage  contention  better  than  alternatives  such  as  locking  and  TM, 
then  the  HWDS  may  have  an  even  greater  benefit  to  multicore  than  to  single  core  com¬ 
puters.  Multicore  warrants  further  study  of  HWDSs. 

7.7  Conclusion 

This  thesis  ponders:  How  should  computers  access  memory?  Since  memory  latency  im¬ 
proves  slower  than  bandwidth,  which  improves  slower  than  processor  speed,  memory  ac¬ 
cesses  hamper  computer  system  performance.  Although  caching  alleviates  some  of  the 
latency  problems,  when  the  cache  inevitably  misses,  performance  suffers.  Instead  of  oper¬ 
ating  in  terms  of  memory  (cache)  accesses,  this  thesis  argues  that  computer  architecture 
and  operating  systems  cooperate  to  support  programming  with  data  structure  operations, 
the  common  coin  of  modern  programming  languages. 
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Appendix  A  —  STL  Profiling:  Containers  and  Comparators 


This  appendix  describes  profiling  the  STL  to  determine  how  applications  use  containers 
and  object  comparisons. 


A.l  Maps  in  the  CH — \-  STL 

Linear  and  tree-based  structures,  which  underlie  STL  containers,  are  commonly  used  by 
programmers.  Commonly  used  containers  are  the  vector,  list,  set,  and  map;  these  con¬ 
tainers  are  included  by  at  least  half  of  21  open-source  C++  programs  covering  a  range  of 
application  domains  including  navigation,  simulation,  computer  vision,  video  games,  doc¬ 
ument  processing,  databases,  operating  systems,  and  web  browsers.  Table  A-l  shows  how 
many  of  the  following  programs  include  which  STL  header  files:  dimacs-sq,  Dijkstra’s  algo¬ 
rithm  with  SmartQ  [26];  Opal  [83]  and  GEM5  [15,  5],  processor  simulators;  Geant4  [9,  4], 
physics  particle  simulator;  FlightGear  [3],  flight  simulator;  Wesnoth  [120],  video  game; 
OpenCV  [132],  computer  vision  library;  Boost  [2],  library  for  C++;  MySQL  [92],  database 
server;  LibreOffice  [122],  office  productivity  suite;  Doxygen,  documentation  generation; 
Haiku  [53],  OS  based  on  BeOS;  ReactOS  [103],  OS  based  on  Windows  NT;  Chromium  [121], 
web  browser;  povray,  soplex,  dealll,  namd,  xalancbmk,  astar,  and  omnetpp,  C++  bench¬ 
marks  from  SPEC  CPU  2006  [116]. 

Digging  deeper,  I  investigated  the  runtime  behavior  of  two  programs — Geant4  and 
Chromium — that  rely  heavily  on  the  STL  containers.  I  modified  the  profile  mode  of  the 


Table  A-l:  STL  container  use  of  21  open-source  C++  programs, 
vector  list  set  multiset  map  bitset  unordered  set  unordered  map 
13  13  13  1  14  6  4  5 
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Table  A-2:  STL  map  use  profiling.  Map  count  is  the  number  of  maps  created  by  the  appli¬ 
cation.  Find  time  is  the  percent  of  application  execution  spent  executing  find  operations. 
Max  find  is  the  size  and  find  time  for  the  map  with  the  most  time  spent  executing  find. 


Application 

Map 

count 

Map  size 
Mean  Max 

,  .  Max  find 

Find  time  .  ~ 

Map  size  /o  lime 

Chromium 

1907 

5.33 

1228 

12.39% 

254 

3.55% 

Geant4:  DNAPhysics 

126 

3074.59 

88680 

23.43% 

38 

11.38% 

gammar  ay  _telescop  e 

750 

73.87 

466 

8.78% 

466 

0.24% 

hadrontherapy 

752 

74.30 

467 

5.07% 

467 

0.15% 

human_phantom 

72 

3459.30 

21683 

5.08% 

408 

4.41% 

microbeam 

540 

529.89 

21683 

3.37% 

240 

0.74% 

brachytherapy 

67 

3744.83 

21683 

3.09% 

576 

2.43% 

radioprotection 

771 

239.08 

53256 

2.69% 

488 

0.06% 

lAr -calorimeter 

751 

74.83 

467 

1.90% 

467 

0.07% 

medicaLlinac 

55 

4544.80 

21683 

1.02% 

224 

0.71% 

underground-physics 

58 

4297.58 

21683 

0.90% 

312 

0.80% 

GNU  CTT  Library  [50], 

which  is 

based  on 

the  Perflint  [80] 

project.  I  added  detailed 

statistics  for  the  map  and  vector  containers  including  the  minimum,  maximum,  and  to¬ 


tal  number  of  elements,  and  the  time  of  insert,  erase,  and  find  operations.  Geant4  and 
Chromium  are  built  and  executed  with  and  without  profile  mode  support.  For  Geant4,  the 
advanced  examples  provided  with  the  release  are  used  as  a  workload.  For  Chromium,  the 
workload  was  just  to  start  the  browser — which  loads  a  blank  page  by  default — and  then 
close  the  browser  interactively;  this  workload  is  subject  to  timing  variations  and  is  only 
useful  for  descriptive  empirical  evidence. 

The  profiled  version  of  each  application  is  executed  to  obtain  measurements  for  the 
STL  container  operations,  and  the  unprofiled  version  is  executed  to  obtain  a  measure  for 
the  overall  workload  execution  time  without  the  overhead  of  profiling.  Table  A-2  shows 
the  results  of  these  runs  with  the  Chromium  and  Geant4  workloads. 

Chromium  spends  about  12%  of  its  time  executing  find  for  the  simple  task  of  starting  up 
and  shutting  down,  and  the  Geant4  DNAPhysics  example  spends  over  23%  of  its  execution 
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time  on  find  and  around  11%  of  its  execution  time  searching  through  a  map  with  only 
38  elements.  Such  a  map  is  an  ideal  candidate  for  hardware  acceleration:  small  and 
frequently  accessed.  Even  in  Chromium,  a  map  of  only  254  elements  consumes  roughly 
3.5%  of  execution  time. 

A. 2  Object  comparison  code 

Using  HWDSs  with  objects  is  challenging  for  the  STL  set  and  map,  because  programmers 
can  write  custom  key  and  value  comparison  code,  which  can  be  hard  to  support  with  paral¬ 
lel  hardware  comparators;  performance  benefits  of  HWDSs  comes  especially  from  parallel 
comparisons.  Efficient  comparison  hardware  exists  for  primitive  data  types  including  inte¬ 
gers,  floats,  and  strings.  If  object-oriented  programs  use  other  data  types,  or  complicated 
combinations  of  these  primitives,  then  HWDSs  would  have  difficulty  providing  any  benefits. 
This  section  describes  a  cursory  investigation  into  whether  C++  programs  use  complicated 
comparisons  or  simple,  supported  primitives. 

To  get  a  sense  of  whether  C++  programs  use  complex  comparisons  with  containers, 
I  extended  the  profile  mode  (used  in  Section  A.l)  with  support  for  printing  the  call  site 
of  container  instantiation.  The  call  site  gives  the  code  location  where  an  object  instance 
is  made,  and  whether  it  uses  a  primitive  comparison  (i.e.  int,  float,  or  string),  or  if  the 
structure  has  some  alternate  comparison  method.  Using  this  profiling  information,  the 
behavior  of  the  heaviest  usage  of  the  STL  map  container  in  the  two  profiled  applications — 
Chromium  and  Geant4 — can  be  found.  The  most  heavily  used  maps  in  both  applications 
are  maps  that  make  straightforward  comparisons  with  primitives  (integer  and  double).  The 
other  maps  tend  to  use  integer,  floating  point,  or  string  comparisons. 

The  map  that  consumes  the  most  time  for  the  Chromium  workload  is  the  observers, 
map  declared  in  the  Notif icationServicelmpl  class  and  used  for  event  notification,  in 
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particular  for  tracking  observers  of  notifications.  This  map  has  integer  keys  with  another 

map  (the  event  sources  and  observers)  as  its  values;  Figure  A-l  shows  the  definition  of  the 

map.  The  use  of  integer  keys  makes  it  viable  for  a  map  HWDS,  as  does  its  small  maximum 

size  of  254  elements.  The  map  that  consumes  the  most  time  in  the  DNAPhysics  example 

of  Geant4  uses  double  keys  (and  has  another  map  has  its  value).  By  observation,  the  map 

keys  that  are  used  in  the  Geant4  code  base  are  integers,  doubles,  and  strings;  other  maps 

have  key  types  that  are  obscured  by  classes  and  type  definitions. 

class  NotificationServicelmpl 

:  public  content :: NotificationService  { 

typedef  ObserverList<content : : NotificationObserver>  NotificationObserverList ; 
typedef  std : : map<uintptr_t,  NotificationObserverList*>  NotificationSourceMap; 
typedef  std : : mapcint,  NotificationSourceMap>  NotificationObserverMap; 

NotificationObserverMap  observers_; 

}; 


Figure  A-l:  The  observers_  map  consumes  3.5%  of  the  Chromium  workload’s  execution 
time  and  uses  a  primitive  type  (integers)  for  its  key. 


A. 3  Summary 


This  appendix  shows  that  real-world  applications  use  STL  maps  in  ways  that  are  amenable 
to  HWDS  support. 


Ill 


